Published May 29, 2026 | Version v1
Preprint Open

Enhanced Multi-Modal UAV Perception using Large Language Models for Autonomous Disaster Reconnaissance

  • 1. ROR icon Vellore Institute of Technology University

Description

This paper presents a sophisticated multi-modal UAV perception framework that integrates LiDAR and Optical Flow Fusion (LOFF) odometry, YOLOv8-based semantic perception, and a narration layer powered by a Large Language Model (LLM) for autonomous disaster reconnaissance in simulation oriented environments.

The proposed framework combines geometric localization, semantic scene understanding, and contextual reasoning to enable intelligent navigation in areas devoid of GPS. LOFF merges the alignment of LiDAR point clouds with optical flow estimation through Factor Graph Optimization (FGO) to achieve robust and drift-minimized pose estimation, while YOLOv8 performs real-time object detection and semantic scene analysis.

To improve interpretability, the framework includes an LLM-based narration module that converts structured UAV perception outputs into human-readable situational intelligence, encompassing hazard alerts and navigation recommendations.

The framework is implemented using ROS, Gazebo, and ArduPilot SITL to facilitate synchronized multi-modal sensor integration and realistic UAV simulation. Experimental evaluations reveal localization drift below 0.15 m over a 100 m flight path, YOLOv8 person detection accuracy of 95.4% with 92.8% recall at an average inference time of 37 ms per frame, and obstacle detection precision exceeding 96%, validating the feasibility of integrating contextual language-based reasoning within perception-driven UAV systems.

Files

Enhanced_Multi_Modal_UAV_Perception_using_Large_Language_Models_for_Autonomous_Disaster_Reconnaissance.pdf