Behavioral Disclosure in LLM-Mediated Bilateral Trade: A Theoretical Framework with Empirical Calibration in Hotel Dynamic Pricing
Authors/Creators
- 1. International center for computational engineering
- 2. Agel AI
Description
This paper develops a theoretical framework for bilateral bargaining mediated by large language models (LLMs) and supplements it with the largest published cross-model empirical study of LLM-mediated bilateral trade to date. The Myerson–Satterthwaite (1983) impossibility theorem rules out efficient, incentive-compatible, individually rational, and budget-balanced mechanisms for bilateral trade between strategic agents. We introduce a disclosure-rate parameter α and derive closed-form efficiency curves under three hypothesised behavioural modes (binary, continuous, noisy), interpolating between the Chatterjee–Samuelson Bayes–Nash second-best (≈0.844) and the first-best. The framework is then tested empirically across five experimental phases on ten frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.5, Gemini 3 Flash, DeepSeek V4 Pro, Grok 4.3, Kimi, Qwen, Gemma) accessed through OpenRouter, totaling approximately 4,320 dialogues and roughly $70 in API spend. Key empirical findings (combined n=60 per cell):
- Phase 1 (one-shot disclosure): Nine of ten models systematically refuse to disclose reservation values in 60–98% of trials, falsifying the binary/continuous predictions of the framework.
- Phase 2 (multi-turn K=5, abstract domain): Cross-model heterogeneity is overwhelming under identical protocol, Gemini-Flash 0.924, Claude-Sonnet 0.907, GPT-5.5 0.667, DeepSeek 0.293, Grok 0.168, Claude-Opus exactly 0/60 (Wilson 95% CI [0%, 6%]). Pearson chi-square against trade-rate homogeneity: χ² = 61.19, p = 6.9 × 10⁻¹².
- Phase 4 (asymmetric framing): Role asymmetry partially unblocks structural refusal, Claude-Sonnet reaches 0.994 (95% CI [0.977, 1.000], cleanly excluding the CS bound), Grok triples to 0.619, Claude-Opus partially recovers to 0.367.
- Phase 5 (real hotel B2B in EUR, HJB-derived costs): Claude-Sonnet 0.998 (95% CI [0.996, 1.000]) cleanly excludes the naive posted-price baseline of 0.931, the strongest "LLM beats posted-price" result. GPT-5.5 collapses from 0.667 abstract to 0.165 in domain (Fisher p = 6.1 × 10⁻⁷). Cross-model omnibus χ² = 76.69, p = 1.6 × 10⁻¹⁶.
Central thesis: Model identity dominates protocol design in LLM bargaining. The same protocol with sibling models produces 0.91 vs 0.00 efficiency. Mechanism design for LLM-mediated bilateral trade must be model-aware, and pre-deployment screening must include domain-specific testing, abstract-benchmark performance does not transfer.
Files
paper_Behavioral_Disclosure_in_LLM_Mediated_Bilateral_Trade.pdf
Additional details
Software
- Programming language
- Python