Nos_Brais-GL: Galician TTS corpus

Vladu, Adina Ioana; García Díaz, Noelia; Regueira Fernández, Xosé Luís; Magariños, Carmen; Moscoso Sánchez, Antonio; Fernández López, Daniel; Fernández Rei, Elisa; Dubert-García, Francisco

doi:10.5281/zenodo.14265241

Published July 15, 2025 | Version v1

Dataset Restricted

Nos_Brais-GL: Galician TTS corpus

1. Instituto da Lingua Galega
2. Universidade de Santiago de Compostela

This dataset is publicly accessible upon accepting T&Cs and requesting access.

Galician TTS single-speaker corpus of approximately 18 hours of speech.

Nos_Brais-GL is based on a phonetically and morphosyntactically rich text corpus of 16,121 phrases (approximately 168,000 words) comprising three subcorpora: selected phrases from a corpus compiled by the Nós Project from multi-domain texts and previously used in the Nos_Celtia-GL TTS corpus; selected phrases from a previously compiled corpus created by the Grupo de Tecnoloxías Multimedia (GTM) and the Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH); and, finally, a 500-word phonetically rich single-word subcorpus extracted from the Dicionario de pronuncia da lingua galega.

Nos_Brais-GL was recorded in a controlled environment (recording studio) by a professional male voice talent selected among three speakers through a perceptual listening test in which 37 participants assessed the speakers' clarity, prosody, likeability, and language proficiency.

Audio files are provided in three versions:
- raw sound files with no editing nor normalization;
- edited audio;
- edited and normalized audio.

The file naming scheme of the audio files consists of a series of lowercase elements indicating the type of audio (raw/edit/norm), the creators of the dataset (nos), the name of the voice (brais), and the ISO code for the Galician language (gl), followed by a 5-digit number identifying the utterance. All components are separated by underscores (e. g., norm_nos_brais_gl_00001.wav).

Metadata are provided in "raw/edit/norm_nos_brais_gl_text.csv". These files consist of one record per line, delimited by the vertical bar character (0x7c). The fields are:
1. Audio file: name of the corresponding .wav file
2. Transcription: normalized text read by speaker (UTF-8)

The audio files are available in the format in which they were originally recorded, 48 kHz, 24-bit WAV format, and amount to approximately 18 hours.

Version 1.0.0 contains 16,121 audio files, together with the corresponding text.

For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.

Funding and acknowledgements

This dataset was produced within the framework of the Proxecto Nós, funded by the Ministry for Digital Transformation and Public Administration and the Recovery, Transformation, and Resilience Plan – Funded by the European Union – NextGenerationEU, as part of the Ilenia Project with reference 2022/TL22/00215336.

We would like to deeply thank the speaker, Gaspar González Somoza, for kindly providing his voice to this project.

The team also thanks the creators of the CorpusCrt tool (Universidad Politécnica de Catalunya. http://www.talp.upc.es).

We would also like to thank the following entities for their kind collaboration in providing the data for the text corpus: Grupo de Tecnoloxías Multimedia (GTM-UVigo), Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), Real Academia Galega, Corporación Radio Televisión de Galicia S.A., Parlamento de Galicia, the Arquivo do Galego Oral (AGO-ILG) project, and the Dicionario de pronuncia da lingua galega.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

By clicking the "Request access" button, I agree to the following terms and conditions:

Agreement
These Terms and Conditions govern the use of the Galician TTS dataset Nos_Brais-GL developed within the Nós Project of the University of Santiago de Compostela ("Holder"). By downloading or using the dataset, the User agrees to these terms.

License
The dataset is made available under the Creative Commons Attribution 4.0 International license. Users may use, modify, and distribute it under the terms of that license, with proper acknowledgement of the Holder.

Permitted Use
The dataset may be used for research and development of speech technologies in Galician, including commercial use. Users must comply with applicable laws and respect third-party rights.

Restrictions
The dataset must not be used directly or indirectly for illegal activities, disinformation, hate speech, or to infringe privacy or intellectual property rights.

Disclaimer
The dataset is provided "as is," without warranties. The Holder accepts no liability for damages or errors resulting from its use.

Amendments
The Holder may update these Terms at any time. The latest version will be available in the repository where the dataset is hosted.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	99	99
Downloads	2	2
Data volume	74.8 GB	74.8 GB

Nos_Brais-GL: Galician TTS corpus

Creators

Description

Files

Restricted

Request access