Swarm-Steward Benchmark Data: Feature RAG, Swarm Scalability, and LLM Model Evaluation Artifacts

Jarabo-Peñas, Alejandro; Bravo Arrabal, Juan; Rolland, Edouard; Christensen, Anders Lyhne

doi:10.5281/zenodo.18715260

Published February 23, 2026 | Version v1

Dataset Open

Swarm-Steward Benchmark Data: Feature RAG, Swarm Scalability, and LLM Model Evaluation Artifacts

1. University of Southern Denmark

Abstract

Swarm-Steward is a platform-agnostic system for natural-language swarm control that enables a non-expert operator to coordinate many drones by creating and commanding groups, bridging high-level intent to reliable, low-level execution. The system uses a hierarchical LLM-based multi-agent design where planning and context gathering are separated from actuation: a tool-less Coordinator decomposes each request into staged sub-tasks and delegates them to specialised agents — a Spatial agent and a History agent for non-actuating context gathering (map relations, telemetry), and a Swarm agent as the sole authority permitted to modify world state by dispatching deterministic, schema-constrained commands through a safety gate (geofencing, altitude and separation limits), with optional operator preview before execution. To keep grounding scalable, Swarm-Steward applies dual retrieval-augmented generation over both map features (Feature RAG) and telemetry variables (State RAG), injecting only relevant candidates at each step. The system supports both MQTT and ROS 2/DDS communication backends and has been validated in simulation as well as on real DJI Mini 4 Pro drones via WildBridge.

This dataset accompanies the paper "Swarm-Steward: Scalable and Reliable Natural-Language Coordination of Autonomous Aerial and Ground Robots", submitted to the International Conference on Unmanned Aircraft Systems (ICUAS 2026). It provides the complete experimental evidence for the three benchmarks reported in the paper (§V): Feature RAG retrieval quality (§V-A), swarm scalability from 5 to 500 drones (§V-B), and a 24-model LLM comparison (§V-C), together with demonstration images from simulation-to-real validation with DJI drones (§V-D).

Evaluates the Feature RAG module that grounds free-form geographic references by embedding map-feature descriptions and retrieving the top-k candidates most similar to the user query. Four embedding models — two cloud (OpenAI text-embedding-3-small, 1,536-d; Google gemini-embedding-001, 3,072-d) and two local (BAAI/bge-m3, 1,024-d; intfloat/e5-large-v2, 1,024-d) — are benchmarked on three operational scenarios (Madrid, 356 features; Málaga, 108; Stockholm, 272) augmented with in-domain synthetic noise (same-category near-duplicates) up to 10,000 features. Metrics: Hit Rate at k=10 (HR@10) and Mean Reciprocal Rank (MRR), computed over four query difficulty levels (exact name, partial name, category, descriptive) with three repetitions per condition.

Principal finding: BGE-M3 (1,024-d, local inference on a commodity laptop GPU with 4 GB VRAM, ~22 ms latency) achieves the highest mean HR@10 of 87% and MRR of 0.79 at the 10,000-feature pool; higher-dimensional cloud models do not confer an advantage. Category queries reach 100% HR@10 across all models; partial-name queries are the primary discriminator. Search latency stays below 21 ms at all pool sizes, and only a fixed top-k set enters prompts, so scaling the map does not inflate LLM context.

Data files:

rag_results.csv (576 rows) — Aggregate metrics per experimental condition (model × scenario × pool size × query type × repetition)
rag_queries.csv (8,112 rows) — Individual query-level results with retrieved feature names and latencies

Figures:

rag_scalability.pdf — HR@10 vs. pool size (2×2 grid by query type), one line per model
rag_model_heatmap.pdf — Model × query-type heatmap at the 10K-feature pool
rag_per_scenario.pdf — Per-scenario 3×4 breakdown (scenario × query type)
rag_noise_comparison.pdf — Generic vs. in-domain noise robustness comparison (3×4 grid)

Benchmark B — Swarm and Group Scalability (§V-B)

Evaluates how the multi-agent pipeline scales with fleet size across two orders of magnitude: 5, 10, 25, 50, 100, 200, and 500 simulated drones. Each configuration is executed 4 times (28 sessions, 252 prompts total). Each session executes a fixed sequence of 9 natural-language prompts covering group takeoff + move, dynamic grouping, formation traversals, feature orbits, area coverage, battery-conditional queries (History agent + State RAG), speed-based drone selection, and coordinated return-to-home — exercising all four agent types in the pipeline. The MQTT-based simulator backend is used to isolate LLM orchestration performance from network and DDS transport variability. The LLM backbone is Llama-3.3-70B (4-bit quantised, cloud-hosted via the Groq API). The orchestration pipeline, Docker containers, and MQTT simulator ran on a desktop PC with an AMD Ryzen 9 5950X (16 cores), 32 GB DDR4 RAM, and an NVIDIA RTX 4090 GPU, under Ubuntu 22.04.

Principal finding: total token consumption per session remains effectively constant at 546k ± 12% regardless of fleet size, confirming O(1) cost complexity enabled by group-level abstraction — the Swarm agent reasons over k ∈ [2, 40] groups rather than individual drones. 245 of 252 prompts succeed (97.2% overall). All 7 failures are concentrated at fleet sizes N ≥ 100 and share the same signature: the Coordinator produces a correct plan, but the Swarm agent completes zero LLM calls despite being invoked, pointing to external LLM API errors — most likely rate limiting under sustained benchmark load, with potential context-window overruns at N = 500 where Swarm agent input tokens can reach ~137k (exceeding the 128k context window). No failures stem from incorrect LLM reasoning, planning, or tool generation. Mean LLM pipeline latency is 6.8 s across all sizes.

Data files:

sessions.csv (28 rows) — Session-level summaries (total tokens, success rate, latency, payload size)
prompts.csv (252 rows) — Per-prompt measurements (latency breakdown, token counts, action outcomes)
agent_executions.csv (777 rows) — Per-agent-per-prompt breakdowns (LLM calls, tool invocations, token attribution)
drone_scalability_mqtt.json (21 MB) — Raw benchmark trace with full LLM conversations, tool calls, and timing data for all 28 sessions

Figures:

Tokens_per_drone.pdf — Per-drone token cost reduction across fleet sizes
Payload.pdf — State payload growth vs. fleet size
LLM_latency.pdf — LLM pipeline latency distribution
Success_Heatmap.pdf — Prompt × fleet-size success rate heatmap
Tool_Usage.pdf — Tool invocation frequency by agent type
Agent_Breakdown.pdf — Token distribution by agent type
Action_Exec_Times.pdf — Action execution time distributions
Session_Time_Actions.pdf — Session wall-clock time vs. actions
LLM_tool_calls.pdf — LLM tool calls per prompt
Dual_axis_scalability.pdf — Dual-axis scalability summary

Benchmark C — Model Benchmark: Planning vs. Tool-Calling Performance (§V-C)

Benchmarks 24 LLM models on a structured multi-agent control session combining direct swarm commands with multi-step requests requiring contextual reasoning — integrating spatial grounding, state inspection, and interaction history before target selection and action dispatch. The evaluated tasks span group management, feature-referenced motion primitives, inspection queries, and recovery operations. Two complementary scores are measured: Plan (high-level coordination accuracy, out of 12) and Sub (strict sub-agent tool-calling reliability, out of 16), together with mean per-prompt latency. Models evaluated include cloud frontier models, open-weight models served via Groq, and a locally deployed 4-bit quantised model (GPT-OSS-20B on an NVIDIA RTX 5090, 24 GB VRAM).

Principal finding: GPT-4o achieves the highest overall scores (Plan 11/12, Sub 16/16, 16.4 s mean latency). Among open-weight models, Llama-3.3-70B (4-bit, cloud-hosted via Groq) delivers the best efficiency–performance trade-off (10/12, 14/16) at 6.7 s — the lowest latency of any high-performing model. For fully local deployment, GPT-OSS-20B (4-bit, laptop GPU) demonstrates the feasibility of offline operation (10/12, 13/16, 19.2 s). In this benchmark, execution robustness correlated more strongly with tool-schema compliance and instruction-following behaviour than with model size or parameter count.

Data files:

batch_results_23_cloud_models.html (17 MB) — Interactive report for 23 cloud/Groq models with per-prompt drill-down, execution traces, latency, token usage, and cost (open in any browser, no server required)
batch_results_1_local_model.html (844 KB) — Interactive report for the local GPU model

Demonstration Images

Images from the simulation-to-real validation (§V-D), in which a seven-step operator script was executed through the same natural-language interface used in simulation on five DJI Mini 4 Pro drones at the SDU cricket field, via the WildBridge ROS 2 adapter.

GUI_example.png — Swarm-Steward web interface showing the operator chat, live map, and telemetry panels during a multi-drone session
real-scenario - formation.png — Real drone formation flight during field validation
real-scenario - take off.jpg — Coordinated takeoff of a 5× DJI Mini 4 Pro swarm
DJI Pro 4.jpeg — DJI Mini 4 Pro drone used for real-world tests

File Formats

CSV: UTF-8, comma-delimited, double-quote escaping. Multi-value fields use semicolons (;) as separators.
JSON: UTF-8 encoded raw benchmark output with full execution traces.
HTML: Self-contained interactive reports (open in any browser, no server required).
PDF: Vector figures with IEEE-compatible styling and colour-blind safe palettes.
Timestamps: ISO 8601 with timezone.

Files

GUI_example.png

Files (45.8 MB)

Name	Size	Download all
Action_Exec_Times.pdf md5:2e2bed6a0614a08f93ca59493727d119	22.5 kB	Preview Download
Agent_Breakdown.pdf md5:f72e172b795be03e6613488423f8b871	26.9 kB	Preview Download
agent_executions.csv md5:b0284734fef55ac02660c864fdc6b767	128.6 kB	Preview Download
batch_results_1_local_model.html md5:cc37fdf313dc82f05b65b2c8ef5ddfa3	864.2 kB	Download
batch_results_23_cloud_models.html md5:7d3f888452b4f02821b1853a3fe7c20d	17.6 MB	Download
DJI Pro 4.jpeg md5:0772141c515ae1a96adaeb92f9d58731	325.7 kB	Preview Download
drone_scalability_mqtt.json md5:7563ce408e593f6f4947ca6743ceaf5e	21.3 MB	Preview Download
Dual_axis_scalability.pdf md5:e90d3278a3306be7e27823b0f2f145bb	25.1 kB	Preview Download
GUI_example.png md5:42f0b7d7144d77c8d95690999571945a	2.6 MB	Preview Download
LLM_latency.pdf md5:0240e1cf4227333db89e56fe0392ce27	16.3 kB	Preview Download
LLM_tool_calls.pdf md5:0a0675faaedabbb67571a0301ab70e9e	17.7 kB	Preview Download
Payload.pdf md5:871641e50a75c3c32f4358db6d680bd0	17.7 kB	Preview Download
prompts.csv md5:c6de7a28279158b7b12b15dbeb0ad3a9	96.6 kB	Preview Download
rag_model_heatmap.pdf md5:5f5e0f1d9249b68618b42ae37227756b	43.0 kB	Preview Download
rag_noise_comparison.pdf md5:c02bccff904d9e993151edd157d42f5a	46.0 kB	Preview Download
rag_per_scenario.pdf md5:ef4c73690a6951d10b1fa4d0795dba09	47.9 kB	Preview Download
rag_queries.csv md5:3cab1fc04eb981a680a2cd9bb4c0f7a3	1.9 MB	Preview Download
rag_results.csv md5:3c9e11f5f49d546c4aca8a9732b952b9	92.8 kB	Preview Download
rag_scalability.pdf md5:6ecf425178537f1203d772c4d061f535	37.5 kB	Preview Download
real-scenario - formation.png md5:0c46ef4c161c502498ba2dc349818735	289.9 kB	Preview Download
real-scenario - take off.jpg md5:9e03929282bd3d62dfc335f4158cac48	313.2 kB	Preview Download
Session_Time_Actions.pdf md5:95a2310b1d0e18343fbaeca01738f8c2	18.1 kB	Preview Download
sessions.csv md5:005d0108b460fe625098e34bf9d0abf3	4.4 kB	Preview Download
Success_Heatmap.pdf md5:4c072a372af6c3eb953ba00e171919d7	28.2 kB	Preview Download
Tokens_per_drone.pdf md5:323c504ec220b1cb835d5c72275fdf6b	23.4 kB	Preview Download
Tool_Usage.pdf md5:7431470db12774b049fc4dbe5110eceb	17.7 kB	Preview Download

Additional details

Danmarks Frie Forskningsfond
The NAMUR project 10.46540/4264-00105B
Innovation Fund Denmark
Innovation Fund Denmark (DIREC U07 – PERSIST) DIREC U07
The Maria Sklodowska-Curie National Research Institute of Oncology
EU Horizon Europe WildDrone Project 101071224

	All versions	This version
Views	204	185
Downloads	120	30
Data volume	398.9 MB	13.9 MB

Abstract

Contents

Benchmark A — Feature RAG Retrieval Quality (§V-A)

Benchmark B — Swarm and Group Scalability (§V-B)

Benchmark C — Model Benchmark: Planning vs. Tool-Calling Performance (§V-C)

Demonstration Images

File Formats

GUI_example.png

Files (45.8 MB)

Funding

Swarm-Steward Benchmark Data: Feature RAG, Swarm Scalability, and LLM Model Evaluation Artifacts

Authors/Creators

Description

Abstract

Contents

Benchmark A — Feature RAG Retrieval Quality (§V-A)

Benchmark B — Swarm and Group Scalability (§V-B)

Benchmark C — Model Benchmark: Planning vs. Tool-Calling Performance (§V-C)

Demonstration Images

File Formats

Files

GUI_example.png

Files (45.8 MB)

Additional details

Funding