Spoken language technologies for under-resourced languages: a case study on Pomak
Creators
- 1. Athena-Research and Innovation Center in Information, Communication and Knowledge Technologies
Description
Automatic speech recognition (ASR) and text-to-speech (TTS) generation systems have gained popularity in the past decade. However, training robust models requires several hundred hours of recorded speech, while most languages do not have enough such resources. Within project PHILOTIS, a pipeline for the Pomak language has been developed. Pomak is an endangered Slavic language spoken in the Balkans, including Greece. Eight hours of Pomak read speech were used to train an ASR and TTS system, and to extract speech-text word alignments. To obtain the ASR system, a pretrained wav2vec2 model was employed, which boasts more than 400k hours in 23 languages, and was fine-tuned on Slavic languages. The Voice Activity Detection (VAD) implementation of PyAnnote was utilized to segment the audio files, and the output was manually corrected. The model was fine-tuned on those segments, resulting in a 1.57 WER. The speech-text alignments were obtained using two alternative paths: i) via the fine-tuned Pomak model, and ii) via SailAlign, a toolkit for speech-text alignment of long audio files. To obtain the alignments, the toolkit was provided with a Pomak grapheme to English phoneme approximation, which allowed to utilize a pre-trained English acoustic model, without training a Pomak ASR. This resulted in equally good alignments, which can be used to segment new recordings. Finally, a two-stage TTS system was trained with the segmented audio samples, by employing the Coqui TTS framework. Specifically, a GlowTTS model was trained for generating mel-spectrograms from text, in addition to a HiFiGAN vocoder for transforming the mel-spectrogram to audio waveform. Preliminary results present promising speech quality based on internal evaluation.
Files
poster.pdf
Files
(1.2 MB)
Name | Size | Download all |
---|---|---|
md5:6b88c1e05ce3b4a99c93082787ff07d3
|
1.2 MB | Preview Download |