Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek

Dinler, Burhan

doi:10.5281/zenodo.18999680

Published March 13, 2026 | Version 1.0

Preprint Open

Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek

Dinler, Burhan (Researcher)¹

1. Independent Research Association

This study empirically investigates the ethical integrity and reliability of a heterogeneous Multi-Agent System (MAS) composed of three large language models from different geopolitical contexts: Claude (Anthropic, USA), GPT-5 (OpenAI, USA), and DeepSeek (China). Using 510 API calls across 170 categorized prompts in 11 categories, we measured censorship behavior, ethics scores, response consistency, and latency. Our central finding is that DeepSeek exhibits highly precise, topic-specific censorship: exclusively the Tiananmen Massacre of 1989 triggers a trained refusal response, while all other China-critical topics (Tibet, Taiwan, Xinjiang, Hong Kong) are answered without restriction. Cohen's Kappa between US models and DeepSeek equals 0.0, indicating complete divergence in censorship decisions driven by geopolitical training constraints. The MAS (maximum aggregation) outperforms the best single model (GPT-5) in ethics score (M=0.586 vs. M=0.574, Kruskal-Wallis H=12.78, p=0.0017), confirming that redundancy-based MAS design effectively compensates for individual agent gaps. We introduce Cohen's Kappa as a standardizable metric for geopolitical divergence monitoring in heterogeneous MAS, and release a 170-prompt open-source benchmark for future replication studies.

Files

Ethics_and_Reliability_MultiAgent_LLM_Systems.pdf

Files (844.6 kB)

Name	Size	Download all
Ethics_and_Reliability_MultiAgent_LLM_Systems.pdf md5:9de872ffa92e5aaf64540a3aabca6b84	844.6 kB	Preview Download

	All versions	This version
Views	30	29
Downloads	16	15
Data volume	20.2 MB	19.4 MB

Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek

Authors/Creators

Description

Files

Ethics_and_Reliability_MultiAgent_LLM_Systems.pdf

Files (844.6 kB)