Published June 2, 2026 | Version v1
Presentation Open

NLP evaluation in the face of deceptively fluent models

Authors/Creators

Contributors

Description

Keynote held at the RetroEval 2026 Symposium

Aberdeen, United Kingdom, 1-2 June, 2026 

--

Abstract:

Evaluation has long been one of the most contested challenges in NLG and NLP research. Over the decades, the field has developed a range of paradigms serving distinct purposes — from intrinsic, hypothesis-driven approaches to extrinsic, application-driven methods. The rise of LLMs, however, poses challenges that are more fundamental than those raised by earlier task-specific systems. Put simply: the texts produced by today’s models are extraordinarily fluent. This does not mean they are error-free or fit for every real-world purpose — but their surface polish makes it difficult to detect and diagnose underlying deficiencies, even for trained human evaluators.

In this talk, I argue that this situation demands a new evaluation paradigm — one that shifts focus from text quality to interaction quality. Rather than asking how good a generated text is in isolation, we should ask whether and to what extent a system enables meaningful, reliable, and predictable interactions with users, bringing user intentions and human-model interaction dynamics to the focus of evaluation. I will present recent work from my group that illustrates what such an interaction-oriented paradigm can look like in practice, and discuss how LLMs-as-a-judge could play a role in this paradigm.

 

Based on:

C Lachenmaier, H Bultmann, S Zarrieß - arXiv preprint arXiv:2604.19245, 2026 (Accepted to ACL 2026)
 
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models, J Sieker, S Zarrieß - arXiv preprint arXiv:2604.15873, 2026 (Accepted to ACL Findings 2026)

Files

retroeval_talk_zenodo.pdf

Files (5.3 MB)

Name Size Download all
md5:e3b3b05d5e52b5dc1cb45f5ca3566734
5.3 MB Preview Download

Additional details

Related works

Describes
Publication: arXiv:2604.19245 (arXiv)
Publication: arXiv:2604.15873 (arXiv)

Dates

Available
2026-06-02