HiTZ-Aholab speech synthesis dataset in Basque

Navas, Eva; Hernaez Rioja, Inmaculada; Saratxaga, Ibon; Sanchez, Jon; García Romillo, Víctor; Flores Ríos, Mariana

doi:10.5281/zenodo.17952596

Published December 16, 2025 | Version 1.0

Video/Audio Open

HiTZ-Aholab speech synthesis dataset in Basque

1. HiTZ Center - Aholab, University of the Basque Country UPV/EHU

General description

This resource is a high-quality Basque speech corpus compiled by HiTZ Zentroa / AhoLab.
It consists of studio-quality audio recordings in WAV format and their corresponding orthographic text transcriptions.

The corpus contains read speech produced by professional native Basque speakers, recorded in a professional recording studio under controlled acoustic conditions. The material was originally designed for the development of text-to-speech (TTS) systems, with careful attention to audio quality, pronunciation clarity, and phonetic coverage.

The recordings were produced by two speakers, Maider and Antton, and cover a range of sentence types and orthographic patterns, including declarative, interrogative, and exclamative sentences, as well as specific categories such as Spanish proper names, numerical expressions, and Basque-specific spelling phenomena (e.g., “tt”).

Corpus composition

The following table summarizes the distribution of utterances by category and speaker:

Category	Maider	Antton
Spanish names	750	750
Interrogative sentences	2100	2103
Exclamative sentences	1476	1476
Declarative sentences	9920	9920
“tt” spelling examples	246	246
Numbers	250	250

In total, the corpus comprises:

Maider: 13,500 utterances, approximately 17 h 33 min
Antton: 13,500 utterances, approximately 16 h 45 min

Technical details

Property	Value
Language	Basque (Euskara, `eu`)
Speakers	Maider, Antton
Speaking style	Read speech
Recording	Professional studio
Sample rate	48,000 Hz
Channels	1 (mono)
Encoding	PCM signed 24-bit, WAV

Intended use

This corpus was primarily designed for text-to-speech (TTS) system development, particularly for high-quality or neural TTS models that benefit from:

Clean, studio-recorded audio
Consistent speaking style
Accurate orthographic transcriptions
Coverage of specific phonetic and orthographic phenomena in Basque

Data organization

The corpus is distributed as one compressed TAR archive per speaker, each containing the corresponding audio recordings in WAV format.

Within each archive:

Audio files are named using a unique utterance identifier, e.g.
NEU_00001.wav, NEU_00002.wav, …
All recordings correspond to read utterances produced by a single speaker.

In addition, a plain-text transcription file is provided per speaker. Each line in the transcription file associates an utterance identifier with its orthographic transcription using the following format:

NEU_00001 text of the sentence

The utterance identifier matches the WAV filename (without extension), enabling straightforward pairing of audio files and transcriptions.

Licensing

Creative Commons Attribution 4.0 International (CC BY 4.0)

Ethical considerations

All speakers provided informed consent for the recording and distribution of their voices.

Funding

The development of this resource has been funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215335, and by a grant from the Department of Culture and Language Policy of the Basque Government (IKER-GAITU project).

Versioning

This is version 1.0 of the dataset.

Contact

aholab@aholab.ehu.eus

HiTZ Center - Aholab, University of the Basque Country UPV/EHU

https://aholab.ehu.eus/aholab/

https://www.hitz.eus/

Files

hitz_aholab_eu_antton_corpus.txt

Files (16.2 GB)

Name	Size
hitz_aholab_eu_antton.tar.gz md5:d6a900c3d603fac42fa88acd8321e5aa	8.0 GB	Download
hitz_aholab_eu_antton_corpus.txt md5:73845837b286ff06a6362edfed9dce9b	1.0 MB	Preview Download
hitz_aholab_eu_maider.tar.gz md5:386059ebfa51bc0f095f8730a8b588e0	8.2 GB	Download
hitz_aholab_eu_maider_corpus.txt md5:73845837b286ff06a6362edfed9dce9b	1.0 MB	Preview Download
README.md md5:37c7f9a0bc4fa236ee53faa9f53e51ab	4.3 kB	Preview Download

	All versions	This version
Views	102	102
Downloads	69	69
Data volume	217.7 GB	217.7 GB

HiTZ-Aholab speech synthesis dataset in Basque

Authors/Creators

Description

General description

Corpus composition

Technical details

Intended use

Data organization

Licensing

Ethical considerations

Funding

Versioning

Contact

Files

hitz_aholab_eu_antton_corpus.txt

Files (16.2 GB)