Artifact VisionBreaker: Fuzzing VLM Machines
Authors/Creators
Description
Abstract. Vision Language Models (VLMs) are deep learning architectures that integrate vision and language by mapping images and text into a shared latent space, typically employing dual encoders and contrastive learning. These models power applications such as image captioning, visual question answering, cross-modal retrieval, medical imaging analysis, and robotics. However, their robustness remains a critical concern, particularly in adversarial settings where input manipulations can degrade performance or expose vulnerabilities.
In this research, we systematically evaluate the robustness of VLMs against adversarial attacks by advancing adversarial attacks to incorporate perturbations in different directions. We analyze existing robustness studies that exploit corrupted images to pinpoint how multiple attacks that push the boundaries of the models work together.
Leveraging these insights, we propose novel attack strategies targeting specific vulnerabilities within VLMs. By benchmarking these attacks against state-of-the-art models, we aim to provide a deeper understanding of their robustness and propose mitigation strategies to enhance VLM security in high-stakes applications.
TODO: Add all instructions. Later.