Published May 29, 2025 | Version v0.1.0
Dataset Open

Dataset for the paper "DialogueAV: a Dialogue-attended Audiovisual Dataset"

  • 1. ROR icon INESC TEC
  • 2. Faculty of Engineering, University of Porto (FEUP)
  • 3. ROR icon National Institute of Informatics
  • 4. School of Engineering, Polytechnic of Porto (ISEP)
  • 5. Hiroshima University

Description

Introduction

This is the official release of the code for DialogueAV: a Dialogue-attended Audiovisual Dataset. Dialogue-AV is a benchmarking dataset with ~258k video clips. Each clip has two dialogue-based descriptions: a Question-Answering Dialogue (QDA) with ten question-answer pairs and a simulated conversation between two "humans" discussing the video.

The dialogues come from human-created captions in SOTA benchmarking datasets and machine-generated captions. We use verified annotations from these top datasets, focusing solely on describing the audiovisual content.

Description

In the Dialogue-AV sample we present next, the input consists of a video containing an audio track along with its original text captions (1). The output is a series of dialogue turns that describe the video's content. We process the input video using audio and video captioners (2), which generate text descriptions corresponding to each modality. All captions, including the original, are transformed into dialogue (4) and question-answer (5) conversations that articulate the audiovisual content.

https://github.com/lvilaca16/dialogue-av/blob/main/docs/figures/example_dialogue.png

Annotations in (4) and (5) undergo automatic validation (3) before they are accepted into Dialogue-AV. In the automatic validation step (3), accepted samples must:

  1. Include between 5 and 20 dialogue turns.
  2. Each dialogue turn must have at least one complete sentence. A complete sentence requires at least 1 subject, predicate, object or noun, and 1 verb; it should end with appropriate punctuation and begin with a named character. Additionally, each complete sentence must contain a minimum of 3 words after removing punctuation (avoid simple sentences as "It rains.").
  3. Avoid using the terms "caption(s)" or "dialogue(s)", thereby eliminating references to the original prompt.

For more details about the data generation process, we refer the reader to the (to be published) manuscript.

Correspondence and Maintenance

For details about the implementation, generation and usage, please check the official GitHub page.
If you observed any issues, please contact us. All project-related issues and feature requests should be submitted through our GitHub Issues page.

Files

Files (61.8 GB)

Name Size Download all
md5:5c2b05f1f682d06935d8f05e633058e5
61.5 GB Download
md5:0702ed4f43803eb57fd14d4be2043318
16.9 MB Download
md5:1701920952c900196d337e2d1d71360b
27.9 MB Download
md5:2794b4635c3a1a76b9e49c9f494f898b
107.1 MB Download
md5:5638701fd044f9512109386f74b43e1e
81.4 MB Download
md5:dff7caf23c3ee0c3ac47278ae42b2957
11.9 MB Download
md5:500fcd40223914ef838273bbaa2cf730
8.7 MB Download

Additional details

Funding

Fundação para a Ciência e Tecnologia
PhD Scholarship 2022.11905.BD

Software

Repository URL
https://github.com/lvilaca16/dialogue-av
Programming language
Python
Development Status
Active