A Behavioural Evaluation Framework for AI Judgement Systems
Description
Beyond the Average Research Series – Working Paper
Description
This working paper introduces a conceptual framework for evaluating the behavioural reliability of AI judgement systems. The framework emerged from the Agents at Work research series (Hull, 2025–2026), which examined how large language models interpret age-coded language in recruitment text and how stable those judgements remain when evaluation tasks are repeated.
Abstract
Large language models are increasingly used to perform evaluative or judgement-based tasks, including classification, moderation, and analytical assessment. In such contexts, reliability cannot be assessed solely through single outputs, as language models may produce varying interpretations across repeated executions of the same task.
This paper proposes a behavioural evaluation framework for examining how AI judgement systems behave under repeated evaluation. The framework focuses on three complementary analytical perspectives: repeated execution of the same evaluative task, observation of internal signals such as confidence or agreement indicators, and independent system comparison across multiple AI models. Together, these perspectives allow researchers to observe patterns of behavioural stability, convergence, drift or fragmentation in AI judgement processes.
Rather than focusing solely on output accuracy, the framework emphasises behavioural observation as a means of understanding how AI systems interpret complex text and how consistently those interpretations are maintained. The proposed structure is intended to support more systematic analysis of reliability in AI judgement systems and to provide a conceptual foundation for future empirical evaluation studies.
Note
This paper is released as a working paper to present the conceptual framework. Future work will extend the framework through larger-scale empirical experiments and behavioural stress testing of AI judgement systems.
Files
A Behavioural Evaluation Framework for AI Judgement Systems.pdf
Files
(409.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c6ee0fa9c5510e136585bd5007eac73d
|
409.3 kB | Preview Download |