A Comprehensive Evaluation of Fine-Tuned LLMs as AI-Generated Text Detectors under Adversarial and Multi-Domain Conditions
Authors/Creators
Description
The emergence of Large Language Models (LLMs) generating fluent, human-like text increases risks like disinformation, requiring robust detection systems effective across domains and against adversarial attacks such as paraphrasing. We leveraged fine-tuned LLMs for AI-generated text detection, evaluating three strategies: supervised fine-tuning (SFT), SFT combined with Direct Preference Optimization (SFT+DPO), and multi-stage fine-tuning (SFT+SFT). We assess these in single and multi-domain contexts, examining domain impact and robustness against adversarial manipulations. Results indicate that this approach consistently matches or outperforms current methods in the literature. Across four evaluation settings that vary domain overlap and generator overlap, our best detector maintains high F1 under ten adversarial transformations (formatting, spelling perturbations, and semantic rewrites). In the most challenging condition, texts from domains unseen during fine-tuning and generated by unseen models, the top multi-stage configuration achieves an average F1 of 95.25%, outperforming strong zero-shot baselines (Binoculars, RADAR, GLTR) and a supervised RAID leaderboard baseline (e5-small) under the same evaluation protocol. These results highlight the practical robustness and generalization properties of fine-tuned LLM-based detectors rather than proposing a new detection paradigm.
Files
A_Comprehensive_Evaluation_of_Fine-Tuned_LLMs_as_AI-Generated_Text_Detectors_under_Adversarial_and_Multi-Domain_Conditions.pdf
Files
(5.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:8a1dc1fd207337499d1826a5d9b72ca8
|
5.7 MB | Preview Download |