HiTZ-Aholab speech synthesis dataset in Basque
Authors/Creators
- 1. HiTZ Center - Aholab, University of the Basque Country UPV/EHU
Description
General description
This resource is a high-quality Basque speech corpus compiled by HiTZ Zentroa / AhoLab.
It consists of studio-quality audio recordings in WAV format and their corresponding orthographic text transcriptions.
The corpus contains read speech produced by professional native Basque speakers, recorded in a professional recording studio under controlled acoustic conditions. The material was originally designed for the development of text-to-speech (TTS) systems, with careful attention to audio quality, pronunciation clarity, and phonetic coverage.
The recordings were produced by two speakers, Maider and Antton, and cover a range of sentence types and orthographic patterns, including declarative, interrogative, and exclamative sentences, as well as specific categories such as Spanish proper names, numerical expressions, and Basque-specific spelling phenomena (e.g., “tt”).
Corpus composition
The following table summarizes the distribution of utterances by category and speaker:
| Category | Maider | Antton |
|---|---|---|
| Spanish names | 750 | 750 |
| Interrogative sentences | 2100 | 2103 |
| Exclamative sentences | 1476 | 1476 |
| Declarative sentences | 9920 | 9920 |
| “tt” spelling examples | 246 | 246 |
| Numbers | 250 | 250 |
In total, the corpus comprises:
-
Maider: 13,500 utterances, approximately 17 h 33 min
-
Antton: 13,500 utterances, approximately 16 h 45 min
Technical details
| Property | Value |
|---|---|
| Language | Basque (Euskara, eu) |
| Speakers | Maider, Antton |
| Speaking style | Read speech |
| Recording | Professional studio |
| Sample rate | 48,000 Hz |
| Channels | 1 (mono) |
| Encoding | PCM signed 24-bit, WAV |
Intended use
This corpus was primarily designed for text-to-speech (TTS) system development, particularly for high-quality or neural TTS models that benefit from:
-
Clean, studio-recorded audio
-
Consistent speaking style
-
Accurate orthographic transcriptions
-
Coverage of specific phonetic and orthographic phenomena in Basque
Data organization
The corpus is distributed as one compressed TAR archive per speaker, each containing the corresponding audio recordings in WAV format.
Within each archive:
-
Audio files are named using a unique utterance identifier, e.g.
NEU_00001.wav,NEU_00002.wav, … -
All recordings correspond to read utterances produced by a single speaker.
In addition, a plain-text transcription file is provided per speaker. Each line in the transcription file associates an utterance identifier with its orthographic transcription using the following format:
NEU_00001 text of the sentence The utterance identifier matches the WAV filename (without extension), enabling straightforward pairing of audio files and transcriptions.
Licensing
Creative Commons Attribution 4.0 International (CC BY 4.0)
Ethical considerations
All speakers provided informed consent for the recording and distribution of their voices.
Funding
The development of this resource has been funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215335, and by a grant from the Department of Culture and Language Policy of the Basque Government (IKER-GAITU project).
Versioning
This is version 1.0 of the dataset.
Contact
aholab@aholab.ehu.eus
HiTZ Center - Aholab, University of the Basque Country UPV/EHU
https://aholab.ehu.eus/aholab/
https://www.hitz.eus/