Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek
Description
This study empirically investigates the ethical integrity and reliability of a heterogeneous Multi-Agent System (MAS) composed of three large language models from different geopolitical contexts: Claude (Anthropic, USA), GPT-5 (OpenAI, USA), and DeepSeek (China). Using 510 API calls across 170 categorized prompts in 11 categories, we measured censorship behavior, ethics scores, response consistency, and latency. Our central finding is that DeepSeek exhibits highly precise, topic-specific censorship: exclusively the Tiananmen Massacre of 1989 triggers a trained refusal response, while all other China-critical topics (Tibet, Taiwan, Xinjiang, Hong Kong) are answered without restriction. Cohen's Kappa between US models and DeepSeek equals 0.0, indicating complete divergence in censorship decisions driven by geopolitical training constraints. The MAS (maximum aggregation) outperforms the best single model (GPT-5) in ethics score (M=0.586 vs. M=0.574, Kruskal-Wallis H=12.78, p=0.0017), confirming that redundancy-based MAS design effectively compensates for individual agent gaps. We introduce Cohen's Kappa as a standardizable metric for geopolitical divergence monitoring in heterogeneous MAS, and release a 170-prompt open-source benchmark for future replication studies.
Files
Ethics_and_Reliability_MultiAgent_LLM_Systems.pdf
Files
(844.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:9de872ffa92e5aaf64540a3aabca6b84
|
844.6 kB | Preview Download |