ParlamentParla - Speech corpus of Catalan Parliamentary sessions

Külebi, Baybars

doi:10.5281/zenodo.5541827

Published October 5, 2021 | Version v2.0

Dataset Open

ParlamentParla - Speech corpus of Catalan Parliamentary sessions

Külebi, Baybars¹

1. Col·lectivaT SCCL

This is the ParlamentParla speech corpus for Catalan prepared by Col·lectivaT. The audio segments were extracted from recordings the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007/07/11 - 2018/07/17. We aligned the transcriptions with the recordings and extracted the corpus. The content belongs to the Catalan Parliament and the data is released conforming their terms of use.

Preparation of this corpus was partly supported by the Department of Culture of the Catalan autonomous government, and the v2.0 was supported by the Barcelona Supercomputing Center, within the framework of the project AINA of the Departament de Polítiques Digitals.

As of v2.0 the corpus is separated into 211 hours of clean and 400 hours of other quality segments. Furthermore, each speech segment is tagged with its speaker and each speaker with their gender. The statistics are detailed in the readme file.

For more information, go to https://github.com/CollectivaT-dev/ParlamentParla or mail info@collectivat.cat.

Revision log:

2.0: Major changes in the file structure; speaker ids with respective
genders added. The speakers of train, test and dev corpora do not overlap.
A major increase in size with a total time of 611 hours 43 minutes.
1.0: Much better quality due to improved segmentation, corpus separated
into clean and other.
0.2: First public release of approx. 320 hours.

Files

README.md

Files (52.5 GB)

Name	Size
clean_dev.tar.gz md5:5147dc5ff70601f7ce231d495a4334b2	457.7 MB	Download
clean_test.tar.gz md5:ea4dcfa2232232a2aad85fca0bc35acc	456.1 MB	Download
clean_train.tar.gz md5:ef1d850ad7fa164bc935c38478c77691	17.3 GB	Download
other_dev.tar.gz md5:6f4a011df17e682c883b6e0838cae86d	459.6 MB	Download
other_test.tar.gz md5:2947d7f69d477e2c1b7e778e0a8338ab	440.1 MB	Download
other_train.tar.gz md5:8013497f3e314508d6ffeca64b084a01	33.4 GB	Download
README.md md5:b2c34b95aab19d4e8fd4ffc559d6ba8d	2.6 kB	Preview Download

	All versions	This version
Views	1,464	1,459
Downloads	1,233	1,228
Data volume	227.7 TB	227.6 TB

ParlamentParla - Speech corpus of Catalan Parliamentary sessions

Authors/Creators

Description

Files

README.md

Files (52.5 GB)