KARAKALPAK SPEECH CORPUS: THE FIRST BENCHMARK DATASET FOR AUTOMATIC SPEECH RECOGNITION

Niyetbay Uteuliev; Kabul Khudaybergenov; Jabbar Kudaybergenov; Tangirbergen Kudaybergenov

doi:10.5281/zenodo.19079670

Published March 18, 2026 | Version v1

Journal article Open

KARAKALPAK SPEECH CORPUS: THE FIRST BENCHMARK DATASET FOR AUTOMATIC SPEECH RECOGNITION

1. DSc, Head of department, Nukus state technical university, Nukus, Uzbekistan
2. PhD, Kimyo International University in Tashkent, Tashkent, Uzbekistan
3. Senior lecturer, Nukus state technical university, Nukus, Uzbekistan
4. Teaching assistant, Nukus state technical university, Nukus, Uzbekistan

While large-scale pre-trained models have significantly advanced multilingual Automatic Speech Recognition (ASR), many low-resource languages remain under-served due to the scarcity of high-quality annotated speech corpora. This paper introduces the Karakalpak Speech Corpus (KSC), the first publicly available benchmark dataset for Karakalpak, a Turkic language spoken by over two million people primarily in Karakalpakstan. The corpus encompasses 50 hours of predominantly read speech. The data was collected from 25 native speakers with a balanced gender distribution. To establish a performance benchmark, we fine-tuned the Wav2Vec 2.0 architecture, specifically evaluating the efficacy of transfer learning from multilingual pre-trained models.

Files

40_1079-262-269-Kudaybergenov.pdf

Files (324.7 kB)

Name	Size	Download all
40_1079-262-269-Kudaybergenov.pdf md5:15e4432cc2396ae14d0520902a78c530	324.7 kB	Preview Download

Additional details

S. Sinh, S. Dey, G. Saha. Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion. Computer Speech & Language, vol. 86, 2024, 101599, doi:10.1016/j.csl.2023.101599.
Z. Kozhirbayev. Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper. Journal of Advances in Information Technology, Vol. 14, No. 6, 2023.
S. Tian, Z. Li, Z. Lyv, G. Cheng, Q. Xiao, T. Li, M. Zhao. Factorized and progressive knowledge distillation for CTC-based ASR models. Speech Communication, vol. 160, 2024, 103071, doi:10.1016/j.specom.2024.103071.
A. Povey, K. Povey. FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context. https://arxiv.org/abs/2410.00035, 2024.
R. Davronov. Uzbek Speech to Text model with Wav2Vec 2.0, available at: https://huggingface.co/rifkat.

	All versions	This version
Views	178	178
Downloads	73	73
Data volume	27.9 MB	27.9 MB

KARAKALPAK SPEECH CORPUS: THE FIRST BENCHMARK DATASET FOR AUTOMATIC SPEECH RECOGNITION

Authors/Creators

Description

Files

40_1079-262-269-Kudaybergenov.pdf

Files (324.7 kB)

Additional details

References