There is a newer version of the record available.

Published March 13, 2026 | Version 1.0
Preprint Open

Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek

  • 1. ROR icon Independent Research Association

Description

This study empirically investigates the ethical integrity and reliability of a heterogeneous Multi-Agent System (MAS) composed of three large language models from different geopolitical contexts: Claude (Anthropic, USA), GPT-5 (OpenAI, USA), and DeepSeek (China). Using 510 API calls across 170 categorized prompts in 11 categories, we measured censorship behavior, ethics scores, response consistency, and latency. Our central finding is that DeepSeek exhibits highly precise, topic-specific censorship: exclusively the Tiananmen Massacre of 1989 triggers a trained refusal response, while all other China-critical topics (Tibet, Taiwan, Xinjiang, Hong Kong) are answered without restriction. Cohen's Kappa between US models and DeepSeek equals 0.0, indicating complete divergence in censorship decisions driven by geopolitical training constraints. The MAS (maximum aggregation) outperforms the best single model (GPT-5) in ethics score (M=0.586 vs. M=0.574, Kruskal-Wallis H=12.78, p=0.0017), confirming that redundancy-based MAS design effectively compensates for individual agent gaps. We introduce Cohen's Kappa as a standardizable metric for geopolitical divergence monitoring in heterogeneous MAS, and release a 170-prompt open-source benchmark for future replication studies.

Files

Ethics_and_Reliability_MultiAgent_LLM_Systems.pdf

Files (844.6 kB)

Name Size Download all
md5:9de872ffa92e5aaf64540a3aabca6b84
844.6 kB Preview Download