Published March 16, 2026 | Version 1.0
Working paper Open

A Behavioural Evaluation Framework for AI Judgement Systems

  • 1. Independent Researcher

Description

Beyond the Average Research Series – Working Paper

Description

This working paper introduces a conceptual framework for evaluating the behavioural reliability of AI judgement systems. The framework emerged from the Agents at Work research series (Hull, 2025–2026), which examined how large language models interpret age-coded language in recruitment text and how stable those judgements remain when evaluation tasks are repeated.

Abstract

Large language models are increasingly used to perform evaluative or judgement-based tasks, including classification, moderation, and analytical assessment. In such contexts, reliability cannot be assessed solely through single outputs, as language models may produce varying interpretations across repeated executions of the same task.

This paper proposes a behavioural evaluation framework for examining how AI judgement systems behave under repeated evaluation. The framework focuses on three complementary analytical perspectives: repeated execution of the same evaluative task, observation of internal signals such as confidence or agreement indicators, and independent system comparison across multiple AI models. Together, these perspectives allow researchers to observe patterns of behavioural stability, convergence, drift or fragmentation in AI judgement processes.

Rather than focusing solely on output accuracy, the framework emphasises behavioural observation as a means of understanding how AI systems interpret complex text and how consistently those interpretations are maintained. The proposed structure is intended to support more systematic analysis of reliability in AI judgement systems and to provide a conceptual foundation for future empirical evaluation studies.

Note

This paper is released as a working paper to present the conceptual framework. Future work will extend the framework through larger-scale empirical experiments and behavioural stress testing of AI judgement systems.

Files

A Behavioural Evaluation Framework for AI Judgement Systems.pdf

Files (409.3 kB)