Nos_Brais-GL
Creators
Description
This dataset is publicly accessible upon accepting T&Cs and requesting access.
Galician TTS single-speaker corpus of approximately 18 hours of speech.
Nos_Brais-GL is based on a phonetically and morphosyntactically rich text corpus of 16,121 phrases (approximately 168,000 words) comprising three subcorpora: selected phrases from a corpus compiled by the Nós Project from multi-domain texts and previously used in the Nos_Celtia-GL TTS corpus; selected phrases from a previously compiled corpus created by the Grupo de Tecnoloxías Multimedia (GTM) and the Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH); and, finally, a 500-word phonetically rich single-word subcorpus extracted from the Dicionario de pronuncia da lingua galega.
Nos_Brais-GL was recorded in a controlled environment (recording studio) by a professional male voice talent selected among three speakers through a perceptual listening test in which 37 participants assessed the speakers' clarity, prosody, likeability, and language proficiency.
Audio files are provided in three versions:
- raw sound files with no editing nor normalization;
- edited audio;
- edited and normalized audio.
The file naming scheme of the audio files consists of a series of lowercase elements indicating the type of audio (raw/edit/norm), the creators of the dataset (nos), the name of the voice (brais), and the ISO code for the Galician language (gl), followed by a 5-digit number identifying the utterance. All components are separated by underscores (e. g., norm_nos_brais_gl_00001.wav).
Metadata are provided in "raw/edit/norm_nos_brais_gl_text.csv". These files consist of one record per line, delimited by the vertical bar character (0x7c). The fields are:
1. Audio file: name of the corresponding .wav file
2. Transcription: normalized text read by speaker (UTF-8)
The audio files are available in the format in which they were originally recorded, 48 kHz, 24-bit WAV format, and amount to approximately 18 hours.
Version 1.0.0 contains 16,121 audio files, together with the corresponding text.
For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.
Funding and acknowledgements
This dataset was produced within the framework of the Proxecto Nós, funded by the Ministry for Digital Transformation and Public Administration and the Recovery, Transformation, and Resilience Plan – Funded by the European Union – NextGenerationEU, as part of the Ilenia Project with reference 2022/TL22/00215336.
We would like to deeply thank the speaker, Gaspar González Somoza, for kindly providing his voice to this project.
The team also thanks the creators of the CorpusCrt tool (Universidad Politécnica de Catalunya. http://www.talp.upc.es).
We would also like to thank the following entities for their kind collaboration in providing the data for the text corpus: Grupo de Tecnoloxías Multimedia (GTM-UVigo), Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), Real Academia Galega, Corporación Radio Televisión de Galicia S.A., Parlamento de Galicia, the Arquivo do Galego Oral (AGO-ILG) project, and the Dicionario de pronuncia da lingua galega.