Dataset for the paper "DialogueAV: a Dialogue-attended Audiovisual Dataset"

Vilaça, Luís; Viana, Paula; Yu, Yi

doi:10.5281/zenodo.13897941

Published May 29, 2025 | Version v0.1.0

Dataset Open

Dataset for the paper "DialogueAV: a Dialogue-attended Audiovisual Dataset"

1. INESC TEC
2. Faculty of Engineering, University of Porto (FEUP)
3. National Institute of Informatics
4. School of Engineering, Polytechnic of Porto (ISEP)
5. Hiroshima University

Introduction

This is the official release of the code for DialogueAV: a Dialogue-attended Audiovisual Dataset. Dialogue-AV is a benchmarking dataset with ~258k video clips. Each clip has two dialogue-based descriptions: a Question-Answering Dialogue (QDA) with ten question-answer pairs and a simulated conversation between two "humans" discussing the video.

The dialogues come from human-created captions in SOTA benchmarking datasets and machine-generated captions. We use verified annotations from these top datasets, focusing solely on describing the audiovisual content.

Description

In the Dialogue-AV sample we present next, the input consists of a video containing an audio track along with its original text captions (1). The output is a series of dialogue turns that describe the video's content. We process the input video using audio and video captioners (2), which generate text descriptions corresponding to each modality. All captions, including the original, are transformed into dialogue (4) and question-answer (5) conversations that articulate the audiovisual content.

https://github.com/lvilaca16/dialogue-av/blob/main/docs/figures/example_dialogue.png

Annotations in (4) and (5) undergo automatic validation (3) before they are accepted into Dialogue-AV. In the automatic validation step (3), accepted samples must:

Include between 5 and 20 dialogue turns.
Each dialogue turn must have at least one complete sentence. A complete sentence requires at least 1 subject, predicate, object or noun, and 1 verb; it should end with appropriate punctuation and begin with a named character. Additionally, each complete sentence must contain a minimum of 3 words after removing punctuation (avoid simple sentences as "It rains.").
Avoid using the terms "caption(s)" or "dialogue(s)", thereby eliminating references to the original prompt.

For more details about the data generation process, we refer the reader to the (to be published) manuscript.

Correspondence and Maintenance

For details about the implementation, generation and usage, please check the official GitHub page.
If you observed any issues, please contact us. All project-related issues and feature requests should be submitted through our GitHub Issues page.

Files

Files (61.8 GB)

Name	Size	Download all
test.hdf5 md5:5c2b05f1f682d06935d8f05e633058e5	61.5 GB	Download
test.parquet md5:0702ed4f43803eb57fd14d4be2043318	16.9 MB	Download
test.tar.gz md5:1701920952c900196d337e2d1d71360b	27.9 MB	Download
train.parquet md5:2794b4635c3a1a76b9e49c9f494f898b	107.1 MB	Download
train.tar.gz md5:5638701fd044f9512109386f74b43e1e	81.4 MB	Download
validation.parquet md5:dff7caf23c3ee0c3ac47278ae42b2957	11.9 MB	Download
validation.tar.gz md5:500fcd40223914ef838273bbaa2cf730	8.7 MB	Download

Additional details

Fundação para a Ciência e Tecnologia
PhD Scholarship 2022.11905.BD

Repository URL: https://github.com/lvilaca16/dialogue-av
Programming language: Python
Development Status: Active

	All versions	This version
Views	121	121
Downloads	199	199
Data volume	2.4 TB	2.4 TB

Introduction

Description

Correspondence and Maintenance

Files (61.8 GB)

Funding

Software

Dataset for the paper "DialogueAV: a Dialogue-attended Audiovisual Dataset"

Authors/Creators

Description

Introduction

Description

Correspondence and Maintenance

Files

Files (61.8 GB)

Additional details

Funding

Software