Fusing Events and Frames with Self-Attention Network for Ball Collision Detection
Description
Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments.
This paper proposes a neural network framework for predicting the collision position and time of an unmanned aerial vehicle with a dynamic object, using only RGB and event-based vision sensors.
The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy.
To facilitate benchmarking, we introduce a multi-modal dataset that enables detailed comparisons of single-modality and fusion-based approaches.
At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1\% on average and 10\% for distances beyond 0.5m, but comes at the cost of +71\% in memory and + 105\% in FLOPs. Notably, the event-based model outperforms the RGB model by 4\% for position and 26\% for time error at a similar computational cost, making it a competitive alternative.
Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency.
These findings highlight the potential of multi-modal perception using RGB and event-based cameras in robotic applications.
Files
models.zip
Files
(132.0 MB)
Name | Size | Download all |
---|---|---|
md5:b3ed4d48b23fb5ae9a3c46c733a15ac5
|
132.0 MB | Preview Download |
Additional details
Additional titles
- Alternative title
- Towards Low-Latency Event-based Obstacle Avoidance on a FPGA-Drone
Dates
- Accepted
-
2025-06