HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

Event-based cameras have recently drawn the attention of the Computer Vision community thanks to their advantages in terms of high temporal resolution, low power consumption and high dynamic range, compared to traditional frame-based cameras. These properties make event-based cameras an ideal choice for autonomous vehicles, robot navigation or UAV vision, among others. However, the accuracy of event-based object classification algorithms, which is of crucial importance for any reliable system working in real-world conditions, is still far behind their frame-based counterparts. Two main reasons for this performance gap are: 1. The lack of effective low-level representations and architectures for event-based object classification and 2. The absence of large real-world event-based datasets. In this paper we address both problems. First, we introduce a novel event-based feature representation together with a new machine learning architecture. Compared to previous approaches, we use local memory units to efficiently leverage past temporal information and build a robust event-based representation. Second, we release the first large real-world event-based dataset for object classification. We compare our method to the state-of-the-art with extensive experiments, showing better classification performance and real-time computation.


Introduction
This paper focuses on the problem of object classification using the output of a neuromorphic asynchronous event-based camera [15,14,53]. Event-based cameras offer a novel path to Computer Vision by introducing a fundamentally new representation of visual scenes, with a drive towards real-time and low-power algorithms.
Contrary to standard frame-based cameras, which rely * This work was supported in part by the EU H2020 ULPEC project (grant agreement number 732642)  50 0 10 20 30 40 x (pixels) 50 60 70 80 0 Figure 1: Pixels of an event-based camera asynchronously generate events as soon as a contrast change is detected in their field of view. As a consequence, the output of an eventbased camera can be extremely sparse and with time resolution of order of microseconds. Because of the asynchronous nature of the data and the high resolution of the temporal component of the events, compared to the spatial one, standard Computer Vision methods can not be directly applied. Top: An event-based camera (left) recording a natural scene (right). Bottom: Visualization of the events stream generated by a moving object. ON and OFF events (Sec. 2) are represented by yellow and cyan dots respectively. This figure, as most of the figures in this paper, is best seen in color. on a pre-defined acquisition rate, in event-based cameras, individual pixels asynchronously emit events when they observe a sufficient change of the local illuminance intensity ( Figure 1). This new principle leads to significant reduction of memory usage and of power consumption and the information contained in standard videos of hundreds megabytes can be naturally compressed in an event stream of few hundreds kilobytes [36,52,63]. Additionally, the time resolution of event-based cameras is orders of magnitude higher than frame-based cameras, reaching up to hundreds of microseconds. Finally, thanks to their logarithmic sensitivity to illumination changes, event-based cameras also have a much larger dynamic range, exceeding 120dB [52]. These characteristics make event-based cameras particularly interesting for applications with strong constraints on latency (e.g. autonomous navigation), power consumption (e.g. UAV vision and IoT), or bandwidth (e.g. tracking and surveillance).
However, due to the novelty of the field, the performance of event-based systems in real-world conditions is still inferior to their frame-based counterparts [28,66]. We argue that two main limiting factors of event-based algorithms are: 1. the limited amount of work on low-level feature representations and architectures for event-based object classification; 2. the lack of large event-based datasets acquired in real-world conditions. In this work, we make important steps towards the solution of both problems.
We introduce a new event-based scalable machine learning architecture, relying on a low-level operator called Local Memory Time Surface. A time surface is a spatiotemporal representation of activities around an event relying on the arrival time of events from neighboring pixels [30]. However, the direct use of this information is sensitive to noise and non-idealities of the sensors. By contrast, we emphasize the importance of using the information carried by past events to obtain a robust representation. Moreover, we show how to efficiently store and access this past information by defining a new architecture based on local memory units, where neighboring pixels share the same memory block. In this way, the Local Memory Time Surfaces can be efficiently combined into a higher-order representation, which we call Histograms of Averaged Time Surfaces.
This results in an event-based architecture which is significantly faster and more accurate than existing ones [30,33,46]. Driven by brain-like asynchronous event based computations, this new architecture offers the perspective of a new class of machine learning algorithms that focus the computational effort only on active parts of the network.
Finally, motivated by the importance of large-scale datasets for the recent progress of Computer Vision systems [16,28,37], we also present a new real-world event-based dataset dedicated to car classification. This dataset is composed of about 24k samples acquired from a car driving in urban and motorway environments. These samples were annotated using a semi-automatic protocol, which we describe below. To the best of our knowledge this is the largest labeled event-based dataset acquired in real-world conditions.
We evaluate our method on our new event-based dataset and on four other challenging ones. We show that our method reaches higher classification rates and faster computation times than existing event-based algorithms.

Event-based camera
Conventional cameras encode the observed scene by producing dense information at a fixed frame-rate. As explained in Sec. 1, this is an inefficient way to encode natural scenes. Following this observation, a variety of eventbased cameras [36,52,63] have been designed over the past few years, with the goal to encode the observed scene adaptively, based on its content.
In this work, we consider the ATIS camera [52]. The ATIS camera contains an array of fully asynchronous pixels, each composed of an illuminance relative change detector and a conditional exposure measurement block. The relative change detector reacts to changes in the observed scene, producing information in the form of asynchronous address events [4], known henceforth as events. Whenever a pixel detects a change in illuminance intensity, it emits an event containing its x-y position in the pixel array, the microsecond timestamp of the observed change and its polarity: i.e. whether the illuminance intensity was increasing (ON events) or decreasing (OFF events). The conditional exposure measurement block measures the absolute luminous intensity observed by a pixel [49]. In the ATIS, the measurement itself is not triggered at fixed frame-rate, but only when a change in the observed scene is detected by the relative change detector.
In this work, the luminous intensity measures from the ATIS camera were used only to generate ground-truth annotations for the dataset presented in Sec. 5. By contrast, the object classification pipeline was designed to operate on change-events only, in order to support generic event-based cameras, whether or not they include the ATIS feature to generate grey levels. In this way, any event-based camera can be used to demonstrate the potential of our approach, while leaving the possibility for further improvement when gray level information is available [39].

Related work
In this section, we first briefly review frame-based object classification, then we describe previous work on eventbased features and object classification. Finally, we discuss existing event-based datasets.

Frame-based Features and Object Classification
There is a vast literature on spatial [40,13,67,57] and spatiotemporal [31,73,60] feature descriptors for frame-based Computer Vision. Early approaches mainly focus on handcrafting feature representations for a given problem by using domain knowledge. Well-designed features combined with shallow classifiers have driven research in object recognition for many decades [72,13,18] and helped understanding and modeling important properties of the object classification problem, such as local geometric invari-ants, color and light properties, etc. [74,1].
In the last few years, the availability of large datasets [16,37] and effective learning algorithms [32,26,68] shifted the research direction towards data driven learning of feature representations [2,22]. Typically this is done by optimizing the weights of several layers of elementary feature extraction operations, such as spatial convolutions, pixelwise transformations, pooling etc. This allowed an impressive improvement in the performance of image classification approaches and many others Computer Vision problems [28,66,75]. Deep Learning models, although less easily interpretable, also allowed understanding higher order geometrical properties of classical problems [8].
By contrast, the work on event-based Computer Vision is still in its early stages and it is unclear which feature representations and architectures are best suited for this problem. Finding adequate low-level feature operations is a fundamental topic both for understanding the properties of eventbased problems and also for finding the best architectures and learning algorithms to solve them.

Event-based Features and Object Classification
The majority of prior work on event-based features focused on detecting and tracking stable features adapted to Simultaneous Localization and Mapping applications [25,54]. Corner detectors, have been defined in [11,71,43], while the works of [61,7] focused on edge and line extraction.
Recently, [12] introduced a feature descriptor based on local distributions of optical flow and applied it to corner detection and gesture recognition. It is inspired by its framebased counterpart [10], but in [12] the algorithm for computing the optical flow relies on the temporal information carried by the events. One limitation of [12] is that the quality of the descriptor strongly depends on the quality of the flow. As a consequence, it loses accuracy in presence of noise or poorly contrasted edges.
Event-based classification algorithms can be divided in two categories: unsupervised learning methods and supervised ones. Most unsupervised approaches train artificial neural networks by reproducing or imitating the learning rules observed in biological neural networks [21,42,38,3,65,41]. Supervised methods [47,29,34,51], similar to what is done in frame-based Computer Vision, try to optimize the weights of artificial networks by minimizing a smooth error function.
The most commonly used architectures for event-based cameras are Spiking Neural Networks (SNN) [5,59,24,17,9,76,46]. SNN are a promising research field; however, their performance is limited by the discrete nature of the events, which makes it difficult to properly train a SNN with gradient descent. To avoid this, some authors [50] use predefined Gabor filters as weights in the network. Others propose to first train a conventional Convolutional Neural Networks (CNN) and then to convert the weights to a SNN [9,58]. In both cases, the obtained solutions are suboptimal and typically the performance is lower than conventional CNNs on frames. Other methods consider a smoothed version of the transfer function of a SNN and directly optimize it [33,45,69]. The convergence of the corresponding optimization problem is still very difficult to obtain and typically only few layers and small networks can be trained.
Recently, [30] proposed an interesting alternative to SNNs by introducing a hierarchical representation based on the definition of Time Surface. In [30], learning is unsupervised and performed by clustering time surfaces at each layer, while the last layer sends its output to a classifier. The main limitations of this method are its high latency, due to the increasing time window needed to compute the time surfaces and the high computational cost of the clustering algorithm.
We propose a much simpler yet effective feature representation. We generalize time surfaces by introducing a memory effect in the network by storing the information carried by past events. We then build our representation by applying a regularization scheme both in time and space to obtain a compact and fast representation. Although the architecture is scalable, we show that once a memory process is introduced, a single layer is sufficient to outperform a multilayer approach directly relying on time surfaces. This reduces computation, but more importantly, adds more generalization and robustness to the network.

Event-based Datasets
An issue of previous work on event-based object classification is that the proposed solutions are tested either on very small datasets [64,30], or on datasets generated by converting standard videos or images to an event-based representation [48,23,35]. In the first case, the small size of the test set prevents an accurate evaluation of the methods. In the second case, the dataset size is large enough to create a valid tool for testing new algorithms. However, since the datasets are generated from static images, the real dynamics of a scene and the temporal resolution of event-based cameras can not be fully employed and there is no guarantee that a method tested on this kind of artificial data will behave similarly in real-world conditions.
The authors of [44] released an event-based dataset adapted to test visual odometry algorithms. Unfortunately, this dataset does not contain labeled information for an object recognition task.
The need of large real-world datasets is a major slowing factor for event-based vision [70]. By releasing a new labeled real-world event-based dataset, and defining an efficient semi-automated protocol based on a single eventbased camera, we intend to accelerate progress toward a robust and accurate event-based object classifier.   6)). Thanks to both the spatial and temporal regularization, the contribution of noise is almost completely suppressed.

Method
In this section, we formalize the event-based representation of visual scenes and describe our event-based architecture for object classification.

Time Surfaces
Given an event-based sensor with pixel grid size M ×N , a stream of events is given by a sequence are the coordinates of the pixel generating the event, t i ≥ 0 the timestamp at which the event was generated, with t i ≤ t j for i < j, and p i ∈ {−1, 1} the polarity of the event, with −1, 1 meaning respectively OFF and ON events, and I is the number of events. From now on we will refer to individual events by e i and to a sequence of events by {e i }.
In [30], the concept of time surface is introduced to describe local spatio-temporal patterns around an event. A time surface can be formalized as a local spatial operator acting on an event e i byT ei (·, ·) : where ρ is the radius of the spatial neighborhood used to compute the time surface.
For an event e i = (x i , t i , p i ), and (z, q) ∈ [−ρ, ρ] 2 × {−1, 1},T ei is given bȳ Where t (x i + z, q) is the time of the last event with polarity q received from pixel x i + z (Fig. 2(a)), and τ is a decay factor giving less weight to events further in the past. Intuitively, a time surface encodes the dynamic context in a neighborhood of an event, hence providing both temporal and spatial information. Therefore, this compact representation of the content of the scene can be useful to classify different patterns.

Local Memory Time Surfaces
To build the feature representation, we start by generalizing the time surfaceT ei of Eq. (2). As shown in Fig. 2(a) using only the time t (x i + z, q) of the last event received in the neighborhood of the time surface pixel x i , leads to a descriptor which is too sensitive to noise or small variations in the event stream.
To avoid this problem, we compute the time surface by considering the history of the events in a temporal window of size ∆t. More precisely, we define a local memory time surface T ei as where As shown in Fig. 2(b), this formulation more robustly describes the real dynamics of the scene while resisting noise and small variations of events. In the supplementary material we compare the results obtained by using Eq. (2) or Eq. (3) on an object classification task, showing the advantage of using the local memory formulation to achieve better accuracy. The name Local Memory Time Surfaces comes from the fact that past events {e j } in N (z,q) (e i ) need to be stored in memory units in order to prevent the algorithm from 'forgetting' past information. In Sec. 4.4, we will describe how memory units can be shared efficiently by neighboring pixels. In this way, we can compute a robust feature representation without significant increase in memory requirements.

Histograms of Averaged Time Surfaces
The local memory time surfaces of Eq. (3) is the elementary spatio-temporal operator we use in our approach. In this section, we describe how this new type of time surface can be used to define a compact representation of an event stream useful for object classification.
Inspired by [13] in frame-based vision, we group adjacent pixels in cells {C l } L l=1 of size K × K. Then, for each cell C, we sum the components of the time surfaces computed on events from C into histograms. More precisely, for a cell C we have:h where, with an abuse of notation, we write e i ∈ C if and only if pixel coordinates (x i , y i ) of the event belong to C. A characteristic of event-based sensors is that the amount of events generated by a moving object is proportional to its contrast: higher contrast objects generate more events than low contrast objects. To make the cell descriptor more invariant to contrast, we therefore normalizeh by the number of events |C| contained in the spatio-temporal window used to compute it. This results in the averaged histogram: |C l | ← |C l | + 1 10: An example of a cell histogram h C (z, p) is shown in Fig. 2(c). Given a stream of events, our final descriptor, which we call HATS for Histograms of Averaged Time Surfaces, is given by concatenating every h C , for all positions z, polarities and cells 1, . . . , L: Fig. 3(a) shows an overview of our method. Similarly to standard Computer Vision methods, we can further group adjacent cells into blocks and perform a block-normalization scheme to obtain more invariance to velocity and contrast [13]. In Sec. 6, we show how this simple representation obtains higher accuracy for event-based object classification compared to previous approaches.

Architecture with Locally Shared Memory Units
Irregular access in event-based cameras is a well known limiting factor for designing efficient event-based algorithms. One of the main problems is that the use of standard hardware accelerations, such as GPU, is not trivial due to the sparse and asynchronous nature of the events. For example, accessing spatial neighbors on contiguous memory blocks can impose significant overheads when processing event-based data.
The architecture computing the HATS representation allows to overcome this memory access issue (Fig. 3). From Eq. (5) we notice that for every incoming event e i , we need to iterate over all events in a past spatio-temporal neighborhood. Since, for small values of ρ, most of the past events would not be in the neighborhood of e i , looping through the entire temporally ordered event stream would be prohibitively expensive and inefficient. To avoid this, we notice that, for ρ ≈ K, the events falling in the same cell C, will share most of the neighbors N (z,q) used to compute Eq. (3). Following this observation, for every cell, we define a shared memory unit M C , where past events relevant for C are stored. In this way, when a new event arrives in C, we update Eq. (5) by only looping through M C , which contains only the relevant past events to compute the Local Memory Time Surface of Eq. (3) (Fig. 3(b)).
Algorithm 1 describes the computation of HATS with memory units. Although this was not the scope of this paper, we notice that Algorithm 1 can be easily parallelized and implemented in dedicated neuromorphic chips [62].

Datasets
We validated our approach on five different datasets: four datasets generated by converting standard framebased datasets to events (namely, the N-MNIST [48], N-Caltech101 [48], MNIST-DVS [63] and CIFAR10-DVS [35] datasets) and a novel dataset, recorded from real-world scenes and introduced for the first time in this paper, which we call N-CARS. We made the N-CARS dataset publicly available for download at http://www.prophesee.ai/dataset-n-cars/.
N-MNIST and N-Caltech101 were obtained by displaying each sample image on an LCD monitor, while an ATIS sensor (Section 2) was moving in front of it [48]. Similarly, the MNIST-DVS and CIFAR10-DVS datasets were created by displaying a moving image on a monitor and recorded with a fixed DVS sensor [63].
In both cases, the result is a conversion of the images of the original datasets into a stream of events suited for evaluating event-based object classification. Fig. 4(a,b) shows some representative examples of the datasets generated from frames, for the N-MNIST and N-Caltech101.

Dataset Acquired Directly as Events: N-CARS
The datasets described in the previous section are good datasets for a first evaluation of event-based classifiers. However, since they were generated by displaying images on a monitor, they are not very representative of data from real-world situations. The main shortcoming results from the limited and predefined motion of the objects.
To overcome these limitations, we created a new dataset by directly recording objects in urban environments with an event-based sensor. The dataset was obtained with the following semi-automatic protocol. First, we captured approximately 80 minutes of video using an ATIS camera (Section 2) mounted behind the windshield of a car. The driving was conducted in a natural way, without particular regards for video quality or content. In a second stage, we converted gray-scale measurements from the ATIS sensor to conventional gray-scale images. We then processed them with a state-of-the-art object detector [55,56], to automatically extract bounding boxes around cars and background samples. Finally, the data was manually cleaned to ensure that the samples were correctly labeled.
Since the gray-scale measurements have the same time resolution of the change detection events, the gray-level images can be easily synchronized with the change detection events. Thus, the positions and timestamps of the bounding boxes can be directly used to extract the corresponding event-based samples from the full event stream. Thanks to our semi-automated protocol, we generated a two-class dataset composed of 12,336 car samples and 11,693 noncars samples (background). The dataset was split in 7940 car and 7482 background training samples, and 4396 car and 4211 background testing samples. Each example lasts 100 milliseconds. More details on the dataset can be found in the supplementary material.
We called this new dataset N-CARS. As shown in Fig. 4(c) the N-CARS is a challenging dataset, containing cars at different poses, speeds and occlusions, as well as a large variety of background scenarios.

Event-based Object Classification
Once the features have been extracted from the events sequences of the database, the problem reduces to a conventional classification problem. To highlight the contribution of our feature representation to classification accuracy, we used a simple linear SVM classifier in all our experiments. A more complex classifier, such as non-linear SVM or Convolutional Neural Networks, could be used to further improve the results.
The parameters for all methods were optimized by splitting the training set and using 20% of the data for validation. Once the best settings were found, the classifier was We noticed little influence of the ρ and τ parameters to accuracy, while small K's improved performance for low resolution inputs. When the input duration is larger than the value of ∆t used to compute the time surfaces (Eq. 4), we compute the features every ∆t and then stack them together.
The baselines methods we consider are HOTS [30], H-First [50] and Spiking Neural Networks (SNN) [33,46]. For H-First we used the code provided by the authors online. For SNN we report the results previously published, when available, while for HOTS we used our implementation of the method described in [30]. As with HATS features, we used a linear SVM on the features extracted with HOTS. Notice that this is in favour of HOTS, since linear SVM is a more powerful classifier than the one used by the authors [30].
Given that no code is available for SNN, we also compared our results with those of a 2-layer SNN architecture we implemented using predefined Gabor filters [6]. We then again train a linear SVM on the output of the network. We call this approach Gabor-SNN. This allowed us to obtain the results for SNN when not readily available in the literature.

Results on the Datasets Converted from Frames
The results for the N-MNIST, N-Caltech101, MNIST-DVS and CIFAR10-DVS datasets are given in Tab. 1. As it is usually done, we report the results in terms of classification accuracy. The complete set of parameters used for the methods are reported in the supplementary material.
Our method has the highest classification rate ever reported for an event-based classification method. The performance improvement is higher for the more challenging N-Caltech101 and CIFAR10-DVS datasets. HOTS and a predefined Gabor-SNN have similar performance, while the H-First learning mechanism is too simple to reach good performance.
Results on the N-CARS Datasets For the N-CARS dataset, the HATS parameters used are K = 10, ρ = 3 and τ = 10 9 µs. In this case, block normalization was not applied because it did not improve results. Since the N-CARS dataset contains only two classes, cars and non-cars, we can consider it as a binary classification problem. Therefore, we also analyze the performance of the methods using ROC curves analysis [19]. The Area Under the Curve (AUC) and the accuracy (Acc.) for our method and the baselines are shown in Tab. 2, while the ROC curves are presented in the supplementary material.
From the results, we see that our method outperforms the baselines by a large margin. The variability contained in a real-world dataset, such as the N-CARS one, is too large for both the H-First and HOTS learning algorithms to converge to a good feature representation. A predefined Gabor-SNN architecture has better accuracy than H-First and HOTS, but still 11% lower than our method. The spatio-temporal regularization implemented in our method is more robust to the noise and variability contained in the dataset.

Latency and Computational Time
Latency is a crucial characteristic for many applications requiring fast reaction time. In this section, we compare HATS , HOTS and Gabor-SNN in terms of their computational time and latency on the N-CARS dataset. All methods are implemented in C++ and run on a laptop equipped with an Intel i7 CPU (64bits, 2.7GHz) and 16GB of RAM.
Tab. 3 compares the average computational times to process a sample. Average computational time per sample was computed by dividing the total time spent to compute the features on the full training set by the number of training samples. As we can see, our method is more than 20x faster than HOTS and almost 40x times faster than a 2-layer SNN. In particular our method is 13 times faster than real time. We also report the average number of events processed per second in Kilo-events per second (Kev/s).  [46] 0.973 ---Deep SNN [33] 0.987 --- Latency represents the time period used to accumulate evidence in order to reach a decision on the object class. In our case, this time period is given by the time window used to compute the features, as longer time windows results in higher latency. Notice that with this definition, the latency is independent from both the computational time and the classification accuracy.
There is a trade-off between latency and classification accuracy: on one side longer time periods yield more information at the cost of higher latency, on the other side they lead to risk of mixing dynamics from separate objects or even different dynamics from the same object. We study this trade-off by plotting the accuracy as a function of the latency for the different methods (Fig. 5). The results were averaged over 5 repetitions. By using only 10ms of events, HATS has higher performance than the baselines applied to the full 100ms events stream. The performance of HATS does not completely saturate, probably due to the presence of cars with really small apparent motion in the dataset.
We also notice that the performance of Gabor-SNN is unstable, especially for low latency. This is due to the spiking architecture of Gabor-SNN for which small variations in the input of a layer can cause large differences at its output.

Conclusion and Future Work
In this work, we presented a new feature representation for event-based object recognition by introducing the notion of Histograms of Averaged Time Surfaces. It validates the idea that information is contained in the relative time between events, provided a regularization scheme is intro- Table 3: Average computational times per sample (the lower the better) and average number of events processed per second, in Kilo-events per second Kev/s (the higher the better), on the N-CARS dataset. Since each sample is 100ms long, our method is more than 13 times faster than real time, while HOTS and Gabor-SNN are respectively 1,5 and 2,8 times slower than real time.  Figure 5: Accuracy as a function of latency on the N-CARS dataset. Our method is consistently more accurate than the baselines and already reaches better performance by using only events contained in the first 10ms of the samples.

N-CARS
duced to limit the effect of noise. The proposed architecture makes efficient use of past information by using local memory units shared by neighboring pixels, outperforming existing spike based methods in both accuracy and efficiency.
In the future, we plan to extend our method by using a feature representation also for the memory units, instead of using raw events. This could be done for example by training a network to learn linear weights to apply to the incoming time surfaces.