Comparative Robustness of Qwen2.5 Instruction-Tuned Variants Versus Base Checkpoints on Adversarial Benchmarks

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20636530

Published June 11, 2026 | Version v1

Report Open

Comparative Robustness of Qwen2.5 Instruction-Tuned Variants Versus Base Checkpoints on Adversarial Benchmarks

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Large language models (LLMs) are trained on huge datasets, which allow them to answer questions from various domains. However, their expertise is confined to the data that they were trained on. In order to specialize LLMs in niche domains like healthcare, various training methods can be employed. Two of these commonly known approaches are retrieval-augmented Generation and model fine-tuning. Five models-Llama-3.1-8B, Gemma-2-9B, Mistral-7B-Instruct, Qwen2.5-7B, and Phi-3.5-Mini-Instruct-were fine-tuned on healthcare data. These models were trained using three distinct approaches: retrieval-aug

Research goal: What is the comparative robustness of Qwen2.5's instruction-tuned variants versus base checkpoints when evaluated on adversarial benchmarks like TruthfulQA or ANLI?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.6/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.6/10.

Files

paper.pdf

Files (90.2 kB)

Name	Size	Download all
paper.pdf md5:ef6bcf39ed5458a739840a928979379c	90.2 kB	Preview Download

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Comparative Robustness of Qwen2.5 Instruction-Tuned Variants Versus Base Checkpoints on Adversarial Benchmarks

Authors/Creators

Description

Notes

Files

paper.pdf

Files (90.2 kB)