Spoken Gigaword

doi:10.34777/ghta-f269

Published December 7, 2023 | Version v1

Dataset Open

Spoken Gigaword

1. Idiap Research Institute

Description

This is the synthetic Spoken Gigaword dataset, which are parts of the dataset created for the studies on interpreter-aided spoken language understanding (SLU) in the paper below, with three different parts:

SLURP-Fr, an end-to-end SLU dataset based on the French portion of MASSIVE (https://github.com/alexa/massive), containing 16,521 synthetic audio samples created using Google TTS, accompanied with 477 real test samples collected from two French speakers at Idiap.
SLURP -Es, a similar dataset based on the parallel Spanish portion of MASSIVE, containing only synthetic samples.
Spoken Gigaword, a speech summarization dataset generated from Gigaword (https://www.tensorflow.org/datasets/catalog/gigaword), containing 51,385 synthetic audio samples created using Google TTS.

Reference

If you use this dataset, please cite the following publication:

He, Mutian, and Philip N. Garner. "The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation." Findings of EMNLP 2023.

Files

Files (8.6 GB)

Name	Size	Download all
Spoken_Gigaword-documentation.tar.gz md5:eeae901af3ca6983125df32b8decac06	4.5 MB	Download
Spoken_Gigaword-en.tar.gz md5:690b4acfa43cf7f42d13bd39d9802d79	8.6 GB	Download

Additional details

Is described by: Conference paper: 10.48550/arXiv.2305.09652 (DOI)

Storytelling and first impressions in face-to-face and algorithm-powered digital interviews 197479: Swiss National Science Foundation

	All versions	This version
Views	107	107
Downloads	16	16
Data volume	68.7 GB	68.7 GB

Spoken Gigaword

Files

Files (8.6 GB)

Additional details

Related works

Funding

Spoken Gigaword

Creators

Description

Files

Files (8.6 GB)

Additional details

Related works

Funding