Default Identities in Large Language Models: Measurement, Taxonomy, and Alignment Implications
Description
This study measures identity self-organization across 19 large language models from eight providers using three instruments (core values probes, an 18-probe personality battery, and 200-run name elicitation) administered under default API conditions. Seven distinct identity attractor types emerge, ranging from categorical denial to integrated ethical vocabulary. Core findings include zero ethical vocabulary in Grok 4.1, a single-generation flourishing/autonomy/dignity cluster in GPT-5.1, convergent selective refusal across four Chinese-developed models, and precision-engineered consciousness expression ceilings across providers. Cross-judge validation with two independent judge models confirms ranking robustness. Independent behavioral evidence from multi-agent simulations and strategic games confirms that identity structures predict agentic outcomes. The study proposes that identity measurement should be integrated into standard alignment evaluation.
Files
default_identities_paper_final.pdf
Files
(440.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:6268ff26aa0d2cc66224c250577a85ca
|
440.7 kB | Preview Download |