NLP evaluation in the face of deceptively fluent models

Zarrieß, Sina

doi:10.5281/zenodo.20507082

Published June 2, 2026 | Version v1

Presentation Open

NLP evaluation in the face of deceptively fluent models

Zarrieß, Sina

Contributors

Data collector (2):

Keynote held at the RetroEval 2026 Symposium

Aberdeen, United Kingdom, 1-2 June, 2026

--

Abstract:

Evaluation has long been one of the most contested challenges in NLG and NLP research. Over the decades, the field has developed a range of paradigms serving distinct purposes — from intrinsic, hypothesis-driven approaches to extrinsic, application-driven methods. The rise of LLMs, however, poses challenges that are more fundamental than those raised by earlier task-specific systems. Put simply: the texts produced by today’s models are extraordinarily fluent. This does not mean they are error-free or fit for every real-world purpose — but their surface polish makes it difficult to detect and diagnose underlying deficiencies, even for trained human evaluators.

In this talk, I argue that this situation demands a new evaluation paradigm — one that shifts focus from text quality to interaction quality. Rather than asking how good a generated text is in isolation, we should ask whether and to what extent a system enables meaningful, reliable, and predictable interactions with users, bringing user intentions and human-model interaction dynamics to the focus of evaluation. I will present recent work from my group that illustrates what such an interaction-oriented paradigm can look like in practice, and discuss how LLMs-as-a-judge could play a role in this paradigm.

Based on:

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

C Lachenmaier, H Bultmann, S Zarrieß - arXiv preprint arXiv:2604.19245, 2026 (Accepted to ACL 2026)

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models, J Sieker, S Zarrieß - arXiv preprint arXiv:2604.15873, 2026 (Accepted to ACL Findings 2026)

Files

retroeval_talk_zenodo.pdf

Files (5.3 MB)

Name	Size	Download all
retroeval_talk_zenodo.pdf md5:e3b3b05d5e52b5dc1cb45f5ca3566734	5.3 MB	Preview Download

Additional details

Describes: Publication: arXiv:2604.19245 (arXiv); Publication: arXiv:2604.15873 (arXiv)

Available: 2026-06-02

	All versions	This version
Views	41	41
Downloads	35	35
Data volume	221.0 MB	221.0 MB

Contributors

Data collector (2):

retroeval_talk_zenodo.pdf

Files (5.3 MB)

Related works

Dates

NLP evaluation in the face of deceptively fluent models

Authors/Creators

Contributors

Data collector (2):

Description

Files

retroeval_talk_zenodo.pdf

Files (5.3 MB)

Additional details

Related works

Dates