There is a newer version of the record available.

Published January 1, 2023 | Version 1.0
Dataset Open

Lists of Karakalpak Stopwords

  • 1. University of Primorska, FAMNIT
  • 2. Urgench state university

Contributors

Contact person:

Data collector:

Data manager:

  • 1. Urgench state university
  • 2. University of Primorska, UPFAMNIT

Description

The dataset presents 3 lists of stopwords in the Karakalpak language. The lists were constructed using three automatic methods applied to the same corpus. 

The corpus was constructed by obtaining a source of 23 school textbooks, it was named "Karakalpak School Corpus". The corpus can be re-constructed using the list of urls of all files comprised in the corpus. The list is part of the dataset (list_of_urls_for_karakalpak_school_corpus.txt).

Description of the methods and the lists:

A set of grammar rules and the TDIDF algorithm were used to automatically collect a list of single-word stopwords. 4014 stopwords were collected. The name of the file: Karakalpak_stopwords_unigrams.txt.

A bigram method was used to extract a list of 3740 bigrams (pairs) of stopwords. The name of the file: Karakalpak_stopwords_bigram.txt.

A set of two-word collocations as stopwords was also extracted. The list has 20745 pairs of stopwords. The name of the file: Karakalpak_stopwords_bigrams_with_collocations.txt.

Files

karakalpak_stopwords.zip

Files (162.8 kB)

Name Size Download all
md5:137f1a39401670cdbcfbf5f8f7767c1b
162.8 kB Preview Download

Additional details

Funding

InnoRenew CoE – Renewable materials and healthy environments research and innovation centre of excellence 739574
European Commission