Speech Recognition for Endangered and Extinct Samoyedic languages

This dataset supplements research paper Speech Recognition for Endangered and Extinct Samoyedic languages by Niko Partanen, Mika Hämäläinen and Tiina Klooster. In this study a serie of Persephone models were trained for Nganasan and Kamas languages. Preprocessing scripts, training data and resulting ASR models are all published in Zenodo under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. The license follows that of the original dataset.

In this study we have used INEL Kamas Corpus 1.0 and Nganasan Spoken Language Corpus 0.2. Both corpora are released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License.

Our paper is to be cited followingly:

Partanen, Niko; Hämäläinen, Mika; Klooster, Tiina 2020. Speech Recognition for Endangered and Extinct Samoyedic languages. Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation.

These corpora are to be cited followingly:

Gusev, Valentin; Klooster, Tiina; Wagner-Nagy, Beáta. 2019. "INEL Kamas Corpus." Version 1.0. Publication date 2019-12-15. http://hdl.handle.net/11022/0000-0007-DA6E-9. Archived in Hamburger Zentrum für Sprachkorpora. In: Wagner-Nagy, Beáta; Arkhipov, Alexandre; Ferger, Anne; Jettka, Daniel; Lehmberg, Timm (eds.). The INEL corpora of indigenous Northern Eurasian languages.

Brykina, Maria - Valentin Gusev - Sándor Szeverényi - Beáta Wagner-Nagy 2018: “Nganasan Spoken Language Corpus (NSLC).” Archived in Hamburger Zentrum für Sprachkorpora. Version 0.2. Publication date 2018-06-12. http://hdl.handle.net/11022/0000-0007-C6F2-8.

We recommend to use the models either through Christopher Cox's [Persephone-ELAN] extension, or through Persephone itself. The experiment numbers in this repository are matched with those in our paper by providing the experiment number of the paper in parenthesis. When loading the model, the data directory and the model number have to correspond.

Experiment description

  • Experiment_01_data: Nganasan data for speaker 1 (1)
  • Experiment_02_data: Nganasan data for speaker 2 (2)
  • Experiment_03_data: Nganasan data for speaker 3 (3)
  • Experiment_07_10_data: Kamas data with different transcription representations (4-9)

Model description

For the model accuracies and exact descriptions, please refer to the publication.

  • Model 01: Nganasan model, original transcript, no spaces (1)
  • Model 02: Nganasan model, original transcript, no spaces (2)
  • Model 03: Nganasan model, original transcript, no spaces (3)
  • Model 07: Kamas model, original transcript, no spaces (4)
  • Model 08: Kamas model, original transcript, with spaces (5)
  • Model 09: Kamas model, IPA transcript, no spaces (6)
  • Model 10: Kamas model, IPA transcript, predicted pauses (7)
  • Model 09_0448: Gradual data augmentation test 1 (6-1)
  • Model 09_0896: Gradual data augmentation test 2 (6-2)
  • Model 09_1856: Gradual data augmentation test 3 (6-3)
  • Model 09_2816: Gradual data augmentation test 4 (6-4)
  • Model 09_3776: Gradual data augmentation test 5 (6-5)