Published March 14, 2023 | Version 1.0.0.
Dataset Embargoed

Nos_Celtia-GL: Galician TTS corpus

Description

Galician TTS single speaker corpus of approximately 25 hours of speech.

Nos_Celtia-GL is a phonetically and morphosyntactically balanced corpus of 20,000 phrases (approximately 200,000 words) comprising two subcorpora: a previously compiled corpus created by the Grupo de Tecnoloxías Multimedia (GTM), together with the Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), and a corpus compiled by the Nós Project from multi-domain texts.

The text corpus statistics are detailed in the table below:

Subcorpus  Sentence no. Word no.   Sentence length (words) Sentence domain / type
GTM 10,000 121,726 1-44
  • Journalistic (written) text
  • Manually designed sentences (interrogative, exclamative, imperative, lists of numbers…)
Nós 10,000 99,622  1-36
  •  21,8% transcripts of oral discourse
  • 17,5% dictionary definitions
  • 12.7% transcripts of parliamentary speeches
  • 20% transcripts of news broadcasts
  • 28% short (<4 words), interrogative, exclamative, imperative, and elliptical sentences

 

While the Nós subcorpus has undergone a thorough linguistic review, we have decided not to adapt the GTM corpus to the current grammatical norms of the Galician language with a view to obtaining a parallel corpus to the previously recorded CRPIH_UVigo-GL-Voices.

Nos_Celtia-GL was recorded in a controlled environment (recording studio) by a professional female voice talent selected among four speakers through a perceptual listening test in which more than 50 participants assessed the speakers' clarity, prosody, likeability, and language proficiency.

The file naming scheme of the audio files consists of a series of lowercase elements indicating the type of audio (raw), the creators of the corpus (nos), the name of the voice (celtia), and the ISO code for the Galician language (gl), followed by a 5-digit number identifying the utterance. All components are separated by underscores (e. g., raw_nos_celtia_gl_00001.wav).

Metadata is provided in "metadata.csv". This file consists of one record per line, delimited by the vertical bar character (0x7c). The fields are:

  1. Audio file: name of the corresponding .wav file

  2. Transcription: non-normalized text read by speaker (UTF-8)

The audio files are available in the format in which they were originally recorded, 48 kHz, 16-bit WAV format, and amount to approximately 25 hours.

Version 1.0.0 contains the raw sound files with no editing nor normalization, together with the corresponding text.

For more information, please go to https://nos.gal/  or contact the Nós project at proxecto.nos@usc.gal.

Funding and acknowledgements

"The Nós project: Galician in the society and economy of Artificial Intelligence" is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU).

We would like to thank the following entities for their kind collaboration in providing the data for the text corpus: Grupo de Tecnoloxías Multimedia (GTM), Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), Real Academia Galega, Corporación Radio Televisión de Galicia S.A., Parlamento de Galicia, and the Arquivo do Galego Oral (ILG) project.

Our gratitude also to Xoán Carlos Goris García, Elia Lago Pereira and Alicia López Besteiro for reviewing part of the audio corpus.

Files

Embargoed

The files will be made publicly available on August 1, 2024.