Published December 16, 2025 | Version 1.0

HiTZ-Aholab speech synthesis dataset in Basque

  • 1. HiTZ Center - Aholab, University of the Basque Country UPV/EHU

Description

General description

This resource is a high-quality Basque speech corpus compiled by HiTZ Zentroa / AhoLab.
It consists of studio-quality audio recordings in WAV format and their corresponding orthographic text transcriptions.

The corpus contains read speech produced by professional native Basque speakers, recorded in a professional recording studio under controlled acoustic conditions. The material was originally designed for the development of text-to-speech (TTS) systems, with careful attention to audio quality, pronunciation clarity, and phonetic coverage.

The recordings were produced by two speakers, Maider and Antton, and cover a range of sentence types and orthographic patterns, including declarative, interrogative, and exclamative sentences, as well as specific categories such as Spanish proper names, numerical expressions, and Basque-specific spelling phenomena (e.g., “tt”).

Corpus composition

The following table summarizes the distribution of utterances by category and speaker:

Category Maider Antton
Spanish names 750 750
Interrogative sentences 2100 2103
Exclamative sentences 1476 1476
Declarative sentences 9920 9920
“tt” spelling examples 246 246
Numbers 250 250

In total, the corpus comprises:

  • Maider: 13,500 utterances, approximately 17 h 33 min

  • Antton: 13,500 utterances, approximately 16 h 45 min

Technical details

Property Value
Language Basque (Euskara, eu)
Speakers Maider, Antton
Speaking style Read speech
Recording Professional studio
Sample rate 48,000 Hz
Channels 1 (mono)
Encoding PCM signed 24-bit, WAV

Intended use

This corpus was primarily designed for text-to-speech (TTS) system development, particularly for high-quality or neural TTS models that benefit from:

  • Clean, studio-recorded audio

  • Consistent speaking style

  • Accurate orthographic transcriptions

  • Coverage of specific phonetic and orthographic phenomena in Basque

Data organization

The corpus is distributed as one compressed TAR archive per speaker, each containing the corresponding audio recordings in WAV format.

Within each archive:

  • Audio files are named using a unique utterance identifier, e.g.
    NEU_00001.wav, NEU_00002.wav, …

  • All recordings correspond to read utterances produced by a single speaker.

In addition, a plain-text transcription file is provided per speaker. Each line in the transcription file associates an utterance identifier with its orthographic transcription using the following format:

 
NEU_00001 text of the sentence 

The utterance identifier matches the WAV filename (without extension), enabling straightforward pairing of audio files and transcriptions.

Licensing

Creative Commons Attribution 4.0 International (CC BY 4.0)

Ethical considerations

All speakers provided informed consent for the recording and distribution of their voices.

Funding

The development of this resource has been funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215335, and by a grant from the Department of Culture and Language Policy of the Basque Government (IKER-GAITU project).

Versioning

This is version 1.0 of the dataset.

Contact

aholab@aholab.ehu.eus

HiTZ Center - Aholab, University of the Basque Country UPV/EHU

https://aholab.ehu.eus/aholab/

https://www.hitz.eus/

Files

hitz_aholab_eu_antton_corpus.txt

Files (16.2 GB)

Name Size
md5:d6a900c3d603fac42fa88acd8321e5aa
8.0 GB Download
md5:73845837b286ff06a6362edfed9dce9b
1.0 MB Preview Download
md5:386059ebfa51bc0f095f8730a8b588e0
8.2 GB Download
md5:73845837b286ff06a6362edfed9dce9b
1.0 MB Preview Download
md5:37c7f9a0bc4fa236ee53faa9f53e51ab
4.3 kB Preview Download