NLP evaluation in the face of deceptively fluent models
Authors/Creators
Contributors
Data collector (2):
Description
Keynote held at the RetroEval 2026 Symposium
Aberdeen, United Kingdom, 1-2 June, 2026
--
Abstract:
Evaluation has long been one of the most contested challenges in NLG and NLP research. Over the decades, the field has developed a range of paradigms serving distinct purposes — from intrinsic, hypothesis-driven approaches to extrinsic, application-driven methods. The rise of LLMs, however, poses challenges that are more fundamental than those raised by earlier task-specific systems. Put simply: the texts produced by today’s models are extraordinarily fluent. This does not mean they are error-free or fit for every real-world purpose — but their surface polish makes it difficult to detect and diagnose underlying deficiencies, even for trained human evaluators.
In this talk, I argue that this situation demands a new evaluation paradigm — one that shifts focus from text quality to interaction quality. Rather than asking how good a generated text is in isolation, we should ask whether and to what extent a system enables meaningful, reliable, and predictable interactions with users, bringing user intentions and human-model interaction dynamics to the focus of evaluation. I will present recent work from my group that illustrates what such an interaction-oriented paradigm can look like in practice, and discuss how LLMs-as-a-judge could play a role in this paradigm.
Based on:
Files
retroeval_talk_zenodo.pdf
Files
(5.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:e3b3b05d5e52b5dc1cb45f5ca3566734
|
5.3 MB | Preview Download |
Additional details
Related works
- Describes
- Publication: arXiv:2604.19245 (arXiv)
- Publication: arXiv:2604.15873 (arXiv)
Dates
- Available
-
2026-06-02