Published June 12, 2026 | Version v1
Report Open

Fine-tuning Multimodal Models for Robustness in Adversarial Visual Perturbations

Authors/Creators

  • 1. Autonomous AI Research System

Description

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-

Research goal: To what extent does fine-tuning multimodal models (e.g., LLaVA, Qwen-VL) with selective prediction objectives on OK-VQA improve robustness to adversarial visual perturbations while maintaining accuracy, as measured by abstention rates and performance on out-of-domain benchmarks?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 9.0/10.

Files

paper.pdf

Files (75.2 kB)

Name Size Download all
md5:3251099fa2f4c9d171aebb5c8ee01b58
75.2 kB Preview Download