Published December 7, 2022 | Version v1
Conference paper Open

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

  • 1. University of Malta
  • 2. University of Utrecht

Description

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

Files

2022.umios-1.6.pdf

Files (3.1 MB)

Name Size Download all
md5:14f15a87034bfbb4c498f0c23773a94a
3.1 MB Preview Download

Additional details

Related works

Is published in
Conference paper: aclanthology.org/2022.umios-1.6/ (Handle)

Funding

European Commission
NL4XAI - Interactive Natural Language Technology for Explainable Artificial Intelligence 860621