Published April 30, 2026 | Version v1
Journal article Open

A Comprehensive Evaluation of Fine-Tuned LLMs as AI-Generated Text Detectors under Adversarial and Multi-Domain Conditions

  • 1. ROR icon University of Cagliari
  • 2. ROR icon University of Thessaly

Description

The emergence of Large Language Models (LLMs) generating fluent, human-like text increases risks like disinformation, requiring robust detection systems effective across domains and against adversarial attacks such as paraphrasing. We leveraged fine-tuned LLMs for AI-generated text detection, evaluating three strategies: supervised fine-tuning (SFT), SFT combined with Direct Preference Optimization (SFT+DPO), and multi-stage fine-tuning (SFT+SFT). We assess these in single and multi-domain contexts, examining domain impact and robustness against adversarial manipulations. Results indicate that this approach consistently matches or outperforms current methods in the literature. Across four evaluation settings that vary domain overlap and generator overlap, our best detector maintains high F1 under ten adversarial transformations (formatting, spelling perturbations, and semantic rewrites). In the most challenging condition, texts from domains unseen during fine-tuning and generated by unseen models, the top multi-stage configuration achieves an average F1 of 95.25%, outperforming strong zero-shot baselines (Binoculars, RADAR, GLTR) and a supervised RAID leaderboard baseline (e5-small) under the same evaluation protocol. These results highlight the practical robustness and generalization properties of fine-tuned LLM-based detectors rather than proposing a new detection paradigm.

Files

A_Comprehensive_Evaluation_of_Fine-Tuned_LLMs_as_AI-Generated_Text_Detectors_under_Adversarial_and_Multi-Domain_Conditions.pdf