Published May 9, 2023 | Version 1.0.0.
Dataset Restricted

Nos_ParlaSpeech-GL: Galician ASR corpus

Description

This corpus is publicly accessible upon accepting T&Cs and requesting access.

Nos_ParlaSpeech-GL is an ASR corpus of more than 1,600 hours of automatically aligned speech and text, created from audio and official transcripts of Galician parliamentary sessions celebrated between 2015 and 2022. The content belongs to the Galician Parliament and the data is released according to their terms of use.

The corpus is split into two subcorpora, "clean" and "other". The segments included in the "clean" subcorpus were filtered according to several alignment quality criteria, whereas the "other" subcorpus comprises the segments that were discarded in the filtering process. The details of both subcorpora can be found in the table below:

Subcorpus No. of hours No. of segments
Clean       1,196.92   667,308
Other   477.71 130,332
Total     1674,63       797,64

 

Moreover, each speech segment is tagged with the ID of its corresponding speaker. Metadata of the different speakers, compiled within the ParlaMint-GL project, can be accessed here.

The file naming scheme of the audio files consists of an ID comprising: a four-letter code in capitals denoting the source of the data (Minutes of the Galician Parliament), followed by a 3-digit number identifying the session number and an 8-digit date number in the DDMMYYYY format, all separated by underscores (e. g., DSPG_095_27012015.wav).

For the transcription files, this ID is preceded, separated by an underscore, by the word indicating the subcorpus to which the file belongs to: "clean" or "other" (e. g., clean_DSPG_095_27012015.stm, other_DSPG_095_27012015.stm).

The corpus is available in STM and JSON formats, and the audio files are released in 16 kHz 16-bit WAV format.

Hugging Face version

The corpus is also available in Hugging Face.

Disclaimer:

We are not responsible for any inconsistencies in speaker identification that stem from misidentification in the original transcripts.

Funding and acknowledgements:

This corpus was compiled in collaboration with VICOMTECH.

"The Nós project: Galician in the society and economy of Artificial Intelligence" is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU).

We would like to thank the Galician Parliament for their kind collaboration in providing the original data.

For more information, please go to https://nos.gal/  or contact the Nós project at proxecto.nos@usc.gal.

Terms and Conditions

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This dataset is publicly accessible, but you have to accept the Terms and Conditions to access its files and content.

Terms and Conditions
 
By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.
 
This document is a legal contract between you and the dataset provider. By accessing and using the dataset, you acknowledge and agree to abide by the Terms and Conditions.
 
Our team may take a few days to process your access request. Thank you in advance for your patience.

You are currently not logged in. Do you have an account? Log in here