Fine-tuning Multimodal Models for Robustness in Adversarial Visual Perturbations
Description
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-
Research goal: To what extent does fine-tuning multimodal models (e.g., LLaVA, Qwen-VL) with selective prediction objectives on OK-VQA improve robustness to adversarial visual perturbations while maintaining accuracy, as measured by abstention rates and performance on out-of-domain benchmarks?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.
Notes
Files
paper.pdf
Files
(75.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:3251099fa2f4c9d171aebb5c8ee01b58
|
75.2 kB | Preview Download |