Published March 17, 2026 | Version v1
Journal article Open

Benchmarking Multi-Agent LLM Architectures for Home Energy Management: Real-World Tariff Validation and Cross-Model Cost-Efficiency Analysis

Authors/Creators

  • 1. ROR icon Westcliff University

Description

Multi-agent large language model (LLM) systems have recently been proposed for home energy management systems (HEMS), but prior work has largely evaluated a single backend or a single market context. This paper benchmarks four LLMs spanning self-hosted, low-cost, mid-tier, and frontier deployment classes (Llama 4 Maverick, DeepSeek-V3, GPT-4.1, Claude Sonnet 4.6) across three US utility-linked tariff profiles, three household archetypes, and three random seeds in 108 total 7-day simulations.

Three of the four tested models — DeepSeek-V3 (37.8%), GPT-4.1 (44.8%), and Claude Sonnet 4.6 (49.3%) — achieve statistically equivalent mean energy cost reductions above 20% versus an unmanaged baseline (p > 0.10 for all pairwise comparisons, n = 27 per group), while Llama 4 Maverick (17.4%) is a significant underperformer (p < 0.001, Cohen's d > 1.0). Because frontier-tier savings are statistically indistinguishable, the deployment decision reduces to API cost and latency: DeepSeek-V3 delivers equivalent savings at an estimated $0.005/day — a 7.5× lower daily API cost than GPT-4.1 and 17× lower than Claude Sonnet 4.6. Tariff structure complexity emerges as a stronger model differentiator than household size alone.

Files

Benchmarking Multi-Agent LLM Architectures for Home Energy Management.pdf

Files (360.0 kB)