Multimodal Vision-Audio-Language Dataset

Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin

doi:10.5281/zenodo.10060785

Published October 31, 2023 | Version 1.0.0

Dataset Restricted

Multimodal Vision-Audio-Language Dataset

1. Goethe University Frankfurt
2. The Hessian Center for Artificial Intelligence

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

Details can be found in the attached report.

Annotation

The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

The split into train, validation and test set follows the split of the original datasets.

Installation

pip install pandas pyarrow

Example

import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])

dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]

Description

The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

Data files

The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/10060785">Log in</a> to check if you have access.

	All versions	This version
Views	387	387
Downloads	5	5
Data volume	36.3 MB	36.3 MB

Multimodal Vision-Audio-Language Dataset

Authors/Creators

Description

Annotation

Installation

Example

Description

Data files

Files

Restricted