There is a newer version of the record available.

Published February 23, 2026 | Version v1
Dataset Open

Swarm-Steward Benchmark Data: Feature RAG, Swarm Scalability, and LLM Model Evaluation Artifacts

Description

Abstract

Swarm-Steward is a platform-agnostic system for natural-language swarm control that enables a non-expert operator to coordinate many drones by creating and commanding groups, bridging high-level intent to reliable, low-level execution. The system uses a hierarchical LLM-based multi-agent design where planning and context gathering are separated from actuation: a tool-less Coordinator decomposes each request into staged sub-tasks and delegates them to specialised agents — a Spatial agent and a History agent for non-actuating context gathering (map relations, telemetry), and a Swarm agent as the sole authority permitted to modify world state by dispatching deterministic, schema-constrained commands through a safety gate (geofencing, altitude and separation limits), with optional operator preview before execution. To keep grounding scalable, Swarm-Steward applies dual retrieval-augmented generation over both map features (Feature RAG) and telemetry variables (State RAG), injecting only relevant candidates at each step. The system supports both MQTT and ROS 2/DDS communication backends and has been validated in simulation as well as on real DJI Mini 4 Pro drones via WildBridge.

This dataset accompanies the paper "Swarm-Steward: Scalable and Reliable Natural-Language Coordination of Autonomous Aerial and Ground Robots", submitted to the International Conference on Unmanned Aircraft Systems (ICUAS 2026). It provides the complete experimental evidence for the three benchmarks reported in the paper (§V): Feature RAG retrieval quality (§V-A), swarm scalability from 5 to 500 drones (§V-B), and a 24-model LLM comparison (§V-C), together with demonstration images from simulation-to-real validation with DJI drones (§V-D).

Contents

The deposit contains 5 structured CSV files1 raw JSON trace (21 MB), 2 interactive HTML reports14 publication-quality figures (PDF), and 4 demonstration images.

Benchmark A — Feature RAG Retrieval Quality (§V-A)

Evaluates the Feature RAG module that grounds free-form geographic references by embedding map-feature descriptions and retrieving the top-k candidates most similar to the user query. Four embedding models — two cloud (OpenAI text-embedding-3-small, 1,536-d; Google gemini-embedding-001, 3,072-d) and two local (BAAI/bge-m3, 1,024-d; intfloat/e5-large-v2, 1,024-d) — are benchmarked on three operational scenarios (Madrid, 356 features; Málaga, 108; Stockholm, 272) augmented with in-domain synthetic noise (same-category near-duplicates) up to 10,000 features. Metrics: Hit Rate at k=10 (HR@10) and Mean Reciprocal Rank (MRR), computed over four query difficulty levels (exact name, partial name, category, descriptive) with three repetitions per condition.

Principal finding: BGE-M3 (1,024-d, local inference on a commodity laptop GPU with 4 GB VRAM, ~22 ms latency) achieves the highest mean HR@10 of 87% and MRR of 0.79 at the 10,000-feature pool; higher-dimensional cloud models do not confer an advantage. Category queries reach 100% HR@10 across all models; partial-name queries are the primary discriminator. Search latency stays below 21 ms at all pool sizes, and only a fixed top-k set enters prompts, so scaling the map does not inflate LLM context.

Data files:

  • rag_results.csv (576 rows) — Aggregate metrics per experimental condition (model × scenario × pool size × query type × repetition)
  • rag_queries.csv (8,112 rows) — Individual query-level results with retrieved feature names and latencies

Figures:

  • rag_scalability.pdf — HR@10 vs. pool size (2×2 grid by query type), one line per model
  • rag_model_heatmap.pdf — Model × query-type heatmap at the 10K-feature pool
  • rag_per_scenario.pdf — Per-scenario 3×4 breakdown (scenario × query type)
  • rag_noise_comparison.pdf — Generic vs. in-domain noise robustness comparison (3×4 grid)

Benchmark B — Swarm and Group Scalability (§V-B)

Evaluates how the multi-agent pipeline scales with fleet size across two orders of magnitude: 5, 10, 25, 50, 100, 200, and 500 simulated drones. Each configuration is executed 4 times (28 sessions, 252 prompts total). Each session executes a fixed sequence of 9 natural-language prompts covering group takeoff + move, dynamic grouping, formation traversals, feature orbits, area coverage, battery-conditional queries (History agent + State RAG), speed-based drone selection, and coordinated return-to-home — exercising all four agent types in the pipeline. The MQTT-based simulator backend is used to isolate LLM orchestration performance from network and DDS transport variability. The LLM backbone is Llama-3.3-70B (4-bit quantised, cloud-hosted via the Groq API). The orchestration pipeline, Docker containers, and MQTT simulator ran on a desktop PC with an AMD Ryzen 9 5950X (16 cores), 32 GB DDR4 RAM, and an NVIDIA RTX 4090 GPU, under Ubuntu 22.04.

Principal finding: total token consumption per session remains effectively constant at 546k ± 12% regardless of fleet size, confirming O(1) cost complexity enabled by group-level abstraction — the Swarm agent reasons over k ∈ [2, 40] groups rather than individual drones. 245 of 252 prompts succeed (97.2% overall). All 7 failures are concentrated at fleet sizes N ≥ 100 and share the same signature: the Coordinator produces a correct plan, but the Swarm agent completes zero LLM calls despite being invoked, pointing to external LLM API errors — most likely rate limiting under sustained benchmark load, with potential context-window overruns at N = 500 where Swarm agent input tokens can reach ~137k (exceeding the 128k context window). No failures stem from incorrect LLM reasoning, planning, or tool generation. Mean LLM pipeline latency is 6.8 s across all sizes.

Data files:

  • sessions.csv (28 rows) — Session-level summaries (total tokens, success rate, latency, payload size)
  • prompts.csv (252 rows) — Per-prompt measurements (latency breakdown, token counts, action outcomes)
  • agent_executions.csv (777 rows) — Per-agent-per-prompt breakdowns (LLM calls, tool invocations, token attribution)
  • drone_scalability_mqtt.json (21 MB) — Raw benchmark trace with full LLM conversations, tool calls, and timing data for all 28 sessions

Figures:

  • Tokens_per_drone.pdf — Per-drone token cost reduction across fleet sizes
  • Payload.pdf — State payload growth vs. fleet size
  • LLM_latency.pdf — LLM pipeline latency distribution
  • Success_Heatmap.pdf — Prompt × fleet-size success rate heatmap
  • Tool_Usage.pdf — Tool invocation frequency by agent type
  • Agent_Breakdown.pdf — Token distribution by agent type
  • Action_Exec_Times.pdf — Action execution time distributions
  • Session_Time_Actions.pdf — Session wall-clock time vs. actions
  • LLM_tool_calls.pdf — LLM tool calls per prompt
  • Dual_axis_scalability.pdf — Dual-axis scalability summary

Benchmark C — Model Benchmark: Planning vs. Tool-Calling Performance (§V-C)

Benchmarks 24 LLM models on a structured multi-agent control session combining direct swarm commands with multi-step requests requiring contextual reasoning — integrating spatial grounding, state inspection, and interaction history before target selection and action dispatch. The evaluated tasks span group management, feature-referenced motion primitives, inspection queries, and recovery operations. Two complementary scores are measured: Plan (high-level coordination accuracy, out of 12) and Sub (strict sub-agent tool-calling reliability, out of 16), together with mean per-prompt latency. Models evaluated include cloud frontier models, open-weight models served via Groq, and a locally deployed 4-bit quantised model (GPT-OSS-20B on an NVIDIA RTX 5090, 24 GB VRAM).

Principal finding: GPT-4o achieves the highest overall scores (Plan 11/12, Sub 16/16, 16.4 s mean latency). Among open-weight models, Llama-3.3-70B (4-bit, cloud-hosted via Groq) delivers the best efficiency–performance trade-off (10/12, 14/16) at 6.7 s — the lowest latency of any high-performing model. For fully local deployment, GPT-OSS-20B (4-bit, laptop GPU) demonstrates the feasibility of offline operation (10/12, 13/16, 19.2 s). In this benchmark, execution robustness correlated more strongly with tool-schema compliance and instruction-following behaviour than with model size or parameter count.

Data files:

  • batch_results_23_cloud_models.html (17 MB) — Interactive report for 23 cloud/Groq models with per-prompt drill-down, execution traces, latency, token usage, and cost (open in any browser, no server required)
  • batch_results_1_local_model.html (844 KB) — Interactive report for the local GPU model

Demonstration Images

Images from the simulation-to-real validation (§V-D), in which a seven-step operator script was executed through the same natural-language interface used in simulation on five DJI Mini 4 Pro drones at the SDU cricket field, via the WildBridge ROS 2 adapter.

  • GUI_example.png — Swarm-Steward web interface showing the operator chat, live map, and telemetry panels during a multi-drone session
  • real-scenario - formation.png — Real drone formation flight during field validation
  • real-scenario - take off.jpg — Coordinated takeoff of a 5× DJI Mini 4 Pro swarm
  • DJI Pro 4.jpeg — DJI Mini 4 Pro drone used for real-world tests

File Formats

  • CSV: UTF-8, comma-delimited, double-quote escaping. Multi-value fields use semicolons (;) as separators.
  • JSON: UTF-8 encoded raw benchmark output with full execution traces.
  • HTML: Self-contained interactive reports (open in any browser, no server required).
  • PDF: Vector figures with IEEE-compatible styling and colour-blind safe palettes.
  • Timestamps: ISO 8601 with timezone.

Files

GUI_example.png

Files (45.8 MB)

Name Size Download all
md5:2e2bed6a0614a08f93ca59493727d119
22.5 kB Preview Download
md5:f72e172b795be03e6613488423f8b871
26.9 kB Preview Download
md5:b0284734fef55ac02660c864fdc6b767
128.6 kB Preview Download
md5:cc37fdf313dc82f05b65b2c8ef5ddfa3
864.2 kB Download
md5:7d3f888452b4f02821b1853a3fe7c20d
17.6 MB Download
md5:0772141c515ae1a96adaeb92f9d58731
325.7 kB Preview Download
md5:7563ce408e593f6f4947ca6743ceaf5e
21.3 MB Preview Download
md5:e90d3278a3306be7e27823b0f2f145bb
25.1 kB Preview Download
md5:42f0b7d7144d77c8d95690999571945a
2.6 MB Preview Download
md5:0240e1cf4227333db89e56fe0392ce27
16.3 kB Preview Download
md5:0a0675faaedabbb67571a0301ab70e9e
17.7 kB Preview Download
md5:871641e50a75c3c32f4358db6d680bd0
17.7 kB Preview Download
md5:c6de7a28279158b7b12b15dbeb0ad3a9
96.6 kB Preview Download
md5:5f5e0f1d9249b68618b42ae37227756b
43.0 kB Preview Download
md5:c02bccff904d9e993151edd157d42f5a
46.0 kB Preview Download
md5:ef4c73690a6951d10b1fa4d0795dba09
47.9 kB Preview Download
md5:3cab1fc04eb981a680a2cd9bb4c0f7a3
1.9 MB Preview Download
md5:3c9e11f5f49d546c4aca8a9732b952b9
92.8 kB Preview Download
md5:6ecf425178537f1203d772c4d061f535
37.5 kB Preview Download
md5:0c46ef4c161c502498ba2dc349818735
289.9 kB Preview Download
md5:9e03929282bd3d62dfc335f4158cac48
313.2 kB Preview Download
md5:95a2310b1d0e18343fbaeca01738f8c2
18.1 kB Preview Download
md5:005d0108b460fe625098e34bf9d0abf3
4.4 kB Preview Download
md5:4c072a372af6c3eb953ba00e171919d7
28.2 kB Preview Download
md5:323c504ec220b1cb835d5c72275fdf6b
23.4 kB Preview Download
md5:7431470db12774b049fc4dbe5110eceb
17.7 kB Preview Download

Additional details

Funding

Danmarks Frie Forskningsfond
The NAMUR project 10.46540/4264-00105B
Innovation Fund Denmark
Innovation Fund Denmark (DIREC U07 – PERSIST) DIREC U07
The Maria Sklodowska-Curie National Research Institute of Oncology
EU Horizon Europe WildDrone Project 101071224