Nos_TranscriSpeech-GL: Galician ASR corpus

Vladu, Adina Ioana; Vázquez Abuín, Marta; Fernández Rei, Elisa; García Díaz, Noelia; Vidal Miguéns, Adrián; Magariños, Carmen

doi:10.5281/zenodo.7717140

Published March 10, 2023 | Version 1.0.0.

Dataset Restricted

Nos_TranscriSpeech-GL: Galician ASR corpus

1. Universidade de Santiago de Compostela

This corpus is publicly accessible upon accepting T&Cs and requesting access.

Nos_TranscriSpeech-GL is a manually transcribed and speech-to-text aligned Galician ASR corpus containing 53 hours of multi-domain speech.

The corpus is divided into four subcorpora according to the type of audio: conferences, debates, speeches, and interviews.

The file naming scheme of both the audio and the transcription files consists of an ID indicating the speaker, followed, if necessary, by a number indicating successive audios by the same speaker. Parts of the same audio are marked by a number separated by an underscore from the speaker ID (e.g., Alberti1_1.wav).

The audio files are released in 44.1 kHz 16-bit WAV format and the transcriptions are available in .stm and .trs. Moreover, the corpus is accompanied by the corresponding speaker metadata and the guide detailing the conventions used for the manual transcription.

The dataset includes detailed linguistic annotation. Transcriptions were carried out following the included guideline (available on Zenodo), with conventions such as: orthographic transcription in standard Galician, marking of dialectal, non-standard and non-Galician forms, annotation of paralinguistic and oral features (e.g. laughter, pauses, overlaps, interruptions, singing), explicit marking of speaker turns and audience interventions, and standardized representation of numerals and acronyms.

Funding and acknowledgements

“The Nós project: Galician in the society and economy of Artificial Intelligence” is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU).

We would like to thank the Corpus Oral Informatizado da Lingua Galega (CORILGA) project for their kind collaboration in providing the original data.

For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.

Terms and Conditions

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

This dataset is publicly accessible, but you have to accept the Terms and Conditions to access its files and content.

Terms and Conditions

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.

This document is a legal contract between you and the dataset provider. By accessing and using the dataset, you acknowledge and agree to abide by the Terms and Conditions.

Our team may take a few days to process your access request. Thank you in advance for your patience.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	714	713
Downloads	33	33
Data volume	160.9 GB	160.9 GB

Nos_TranscriSpeech-GL: Galician ASR corpus

Creators

Description

Files

Restricted

Request access