Fine-tuning Multimodal Models for Robustness in Adversarial Visual Perturbations

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20660250

Published June 12, 2026 | Version v1

Report Open

Fine-tuning Multimodal Models for Robustness in Adversarial Visual Perturbations

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-

Research goal: To what extent does fine-tuning multimodal models (e.g., LLaVA, Qwen-VL) with selective prediction objectives on OK-VQA improve robustness to adversarial visual perturbations while maintaining accuracy, as measured by abstention rates and performance on out-of-domain benchmarks?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 9.0/10.

Files

paper.pdf

Files (75.2 kB)

Name	Size	Download all
paper.pdf md5:3251099fa2f4c9d171aebb5c8ee01b58	75.2 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Fine-tuning Multimodal Models for Robustness in Adversarial Visual Perturbations

Authors/Creators

Description

Notes

Files

paper.pdf

Files (75.2 kB)