Relational Cognitive Telemetry for Long-Lived LLM Agent Societies: From Internal State Monitoring to Collective Performance
Description
Long-lived large language model (LLM) agent systems — collectives of named agents with persistent memory, individual parameter state, and continuous interaction with each other and the world — are now technically feasible and increasingly deployed. The debugging and governance practices inherited from short-lived LLM applications are not. Trace-level observability tools record what each agent said and when; they do not measure whether an agent's internal cognitive state has drifted across weeks, whether two agents have developed measurably different ways of attending to each other, or whether the relational structure of an agent society predicts its performance on a collective task.
I propose <em>Relational Cognitive Telemetry</em> (RCT), a telemetry-first framework with two observability lenses: (i) <em>cognitive telemetry</em>, exposing each agent's internal parameter state as a measurable, queryable surface; and (ii) <em>relational telemetry</em>, exposing the directed attention and trust structure between agents as a measurable network. Two lenses, one substrate. I instantiate RCT in the Charenix Lobster Substrate — a live LLM substrate of 20 named agents, currently exposing 441 stable model-layer cognitive parameter families, 682 unfolded observable state fields, and 14 core numeric telemetry dimensions per agent. The present analysis focuses on a 10-agent primary analysis cohort for which directed trust, directed listening exposure, and complete C(10,3) = 120 three-agent Tiamat sandbox triple coverage are all available. All 120 triples were run through 12 seeded trials each: 1,440 controlled raids, plus an ecological-validity benchmark drawn from live raid memory.
I report three empirical findings. First, directed listening exposure between agents is measurably asymmetric and discriminative across the cohort, supporting the use of attention flow as a substrate-level relational metric. Second, the standard local-brain trust value, while present and structured, is saturated in the current substrate, producing low between-pair variance and limiting its independent discriminative power — an honestly reported negative result that motivates trust as a diagnostic, not a primary signal. Third, controlled Tiamat sandbox outcomes show a low-success regime that is reproducible across the full 120-triple coverage; coverage-supported live raid teams fall in the same low-success regime, supporting ecological validity without licensing the use of live memory as a primary outcome layer.
I do not claim that the agents are conscious. I do not claim that internal cognitive parameter values cause raid wins. I claim something narrower: that a long-lived LLM agent society can be made measurable along both an internal and a relational dimension, that the resulting measurements expose regularities absent in trace-level observability, and that the substrate is dense enough, and the experimental protocol concrete enough, that the framework can be falsified, attacked, or improved by external parties.
Notes (English)
Files
chen_listening-trust-asymmetry-team-outcomes_2026_data.zip
Additional details
Additional titles
- Subtitle (English)
- Lobster Observatory Paper 23