Benchmarking Multi-Agent LLM Architectures for Home Energy Management: Real-World Tariff Validation and Cross-Model Cost-Efficiency Analysis
Description
Multi-agent large language model (LLM) systems have recently been proposed for home energy management systems (HEMS), but prior work has largely evaluated a single backend or a single market context. This paper benchmarks four LLMs spanning self-hosted, low-cost, mid-tier, and frontier deployment classes (Llama 4 Maverick, DeepSeek-V3, GPT-4.1, Claude Sonnet 4.6) across three US utility-linked tariff profiles, three household archetypes, and three random seeds in 108 total 7-day simulations.
Three of the four tested models — DeepSeek-V3 (37.8%), GPT-4.1 (44.8%), and Claude Sonnet 4.6 (49.3%) — achieve statistically equivalent mean energy cost reductions above 20% versus an unmanaged baseline (p > 0.10 for all pairwise comparisons, n = 27 per group), while Llama 4 Maverick (17.4%) is a significant underperformer (p < 0.001, Cohen's d > 1.0). Because frontier-tier savings are statistically indistinguishable, the deployment decision reduces to API cost and latency: DeepSeek-V3 delivers equivalent savings at an estimated $0.005/day — a 7.5× lower daily API cost than GPT-4.1 and 17× lower than Claude Sonnet 4.6. Tariff structure complexity emerges as a stronger model differentiator than household size alone.
Files
Benchmarking Multi-Agent LLM Architectures for Home Energy Management.pdf
Files
(360.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:203fcb8a88cb09ce26072d16aebbbc7f
|
360.0 kB | Preview Download |