Nos_Celtia-GL: Galician TTS corpus

Vázquez Abuín, Marta; García Díaz, Noelia; Vladu, Adina Ioana; Magariños, Carmen; Vidal Miguéns, Adrián; Fernández Rei, Elisa

doi:10.5281/zenodo.7716958

Published March 14, 2023 | Version 1.0.0.

Dataset Restricted

Nos_Celtia-GL: Galician TTS corpus

1. Universidade de Santiago de Compostela

This corpus is publicly accessible upon accepting T&Cs and requesting access.

Galician TTS single speaker corpus of approximately 25 hours of speech.

Nos_Celtia-GL is a phonetically and morphosyntactically rich corpus of 20,000 phrases (approximately 200,000 words) comprising two subcorpora: a previously compiled corpus created by the Grupo de Tecnoloxías Multimedia (GTM), together with the Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), and a corpus compiled by the Nós Project from multi-domain texts.

The text corpus statistics are detailed in the table below:

Subcorpus	Sentence no.	Word no.	Sentence length (words)	Sentence domain / type
GTM	10,000	121,726	1-44	Journalistic (written) text Manually designed sentences (interrogative, exclamative, imperative, lists of numbers…)
Nós	10,000	99,622	1-36	21,8% transcripts of oral discourse 17,5% dictionary definitions 12.7% transcripts of parliamentary speeches 20% transcripts of news broadcasts 28% short (<4 words), interrogative, exclamative, imperative, and elliptical sentences

While the Nós subcorpus has undergone a thorough linguistic review, we have decided not to adapt the GTM corpus to the current grammatical norms of the Galician language with a view to obtaining a parallel corpus to the previously recorded CRPIH_UVigo-GL-Voices.

Nos_Celtia-GL was recorded in a controlled environment (recording studio) by a professional female voice talent selected among four speakers through a perceptual listening test in which more than 50 participants assessed the speakers' clarity, prosody, likeability, and language proficiency.

The file naming scheme of the audio files consists of a series of lowercase elements indicating the type of audio (raw), the creators of the corpus (nos), the name of the voice (celtia), and the ISO code for the Galician language (gl), followed by a 5-digit number identifying the utterance. All components are separated by underscores (e. g., raw_nos_celtia_gl_00001.wav).

Metadata is provided in "metadata.csv". This file consists of one record per line, delimited by the vertical bar character (0x7c). The fields are:

1. Audio file: name of the corresponding .wav file

2. Transcription: non-normalized text read by speaker (UTF-8)

The audio files are available in the format in which they were originally recorded, 48 kHz, 16-bit WAV format, and amount to approximately 25 hours.

Version 1.0.0 contains the raw sound files with no editing nor normalization, together with the corresponding text.

For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.

Terms and conditions

The property to the speech data contained in this dataset has been transferred to the University of Santiago de Compostela (USC) for the duration of 15 years. Starting 30/11/2037, this data will be removed. After this date, the USC is not liable for any use by third parties who might have downloaded the dataset.

Citing

Please refer to our paper for more details: Nos_Celtia-GL: an Open High-Quality Speech Synthesis Resource for Galician

If you use this data in your work, please cite: García Díaz, N., Vázquez Abuín, M., Magariños, C., Vladu, A.I., Moscoso Sánchez, A., Fernández Rei, E. (2024) Nos_Celtia-GL: an Open High-Quality Speech Synthesis Resource for Galician. Proc. IberSPEECH 2024, 91-95, doi: 10.21437/IberSPEECH.2024-19

Funding and acknowledgements

"The Nós project: Galician in the society and economy of Artificial Intelligence" is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU).

We would like to thank the speaker, Consuelo Díaz Isorna, for kindly providing her voice to this project.

We would also like to thank the following entities for their kind collaboration in providing the data for the text corpus: Grupo de Tecnoloxías Multimedia (GTM), Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), Real Academia Galega, Corporación Radio Televisión de Galicia S.A., Parlamento de Galicia, and the Arquivo do Galego Oral (ILG) project.

Our gratitude also to Xoán Carlos Goris García, Elia Lago Pereira and Alicia López Besteiro for reviewing part of the audio corpus.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/7716958">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

By clicking the "Request access" button, I agree to the terms and conditions.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	1,199	1,194
Downloads	24	24
Data volume	369.3 GB	369.3 GB

Nos_Celtia-GL: Galician TTS corpus

Authors/Creators

Description

Files

Restricted

Request access