Published September 19, 2022 | Version v1
Dataset Open

CoRoLa Frequency Lists

  • 1. Research Institute for Artificial Intelligence "Mihai Drăgănescu", Romanian Academy

Description

The Reference Corpus for Contemporary Romanian Language (CoRoLa) was constructed as a priority project of the Romanian Academy. It contains both written texts and oral recordings. Its aim is to cover major functional language styles (legal, scientific, journalistic, imaginative, memoirs, administrative), in four domains (arts and culture, nature, society, science) and in 71 sub-domains while taking into account intellectual property rights (IPR). With over 1 billion word tokens (written and spoken), CoRoLa is one of the largest fully IPR-cleared Reference Corpus in the world. https://corola.racai.ro 
 

This dataset contains multiple frequency lists extracted from CoRoLa. There are 12 word-based frequency lists and 12 lemma-based frequency lists. These were constructed only from tokens containing letters (tokens with numbers or special symbols were excluded). Lemmatization was performed automatically at corpus level using the TTL tool. The following files are available:

  • corola_word_freq_all  frequency list for all tokens, as they appear in the corpus
  • corola_word_freq_all_nodiacritics frequency list for all tokens, with diacritics removed (replaced with ASCII corresponding letters)
  • corola_word_freq_all_lowercase frequency list for all tokens lowercased
  • corola_word_freq_all_lowercase_nodiacritics frequency list for all tokens lowercased and with diacritics removed
  • corola_word_freq_gte5  frequency list for tokens appearing at least 5 times in the corpus
  • corola_word_freq_gte5_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
  • corola_word_freq_gte5_lowercase frequency list for tokens appearing at least 5 times in the corpus, lowercased
  • corola_word_freq_gte5_lowercase_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, lowercased and with diacritics removed
  • corola_word_freq_gte10  frequency list for tokens appearing at least 10 times in the corpus
  • corola_word_freq_gte10_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
  • corola_word_freq_gte10_lowercase frequency list for tokens appearing at least 10 times in the corpus, lowercased
  • corola_word_freq_gte10_lowercase_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, lowercased and with diacritics removed
  • corola_lemma_freq_all  frequency list for all lemmas, as they appear in the corpus
  • corola_lemma_freq_all_nodiacritics frequency list for all lemmas, with diacritics removed (replaced with ASCII corresponding letters)
  • corola_lemma_freq_all_lowercase frequency list for all lemmas lowercased
  • corola_lemma_freq_all_lowercase_nodiacritics frequency list for all lemmas lowercased and with diacritics removed
  • corola_lemma_freq_gte5  frequency list for lemmas appearing at least 5 times in the corpus
  • corola_lemma_freq_gte5_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
  • corola_lemma_freq_gte5_lowercase frequency list for lemmas appearing at least 5 times in the corpus, lowercased
  • corola_lemma_freq_gte5_lowercase_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, lowercased and with diacritics removed
  • corola_lemma_freq_gte10  frequency list for lemmas appearing at least 10 times in the corpus
  • corola_lemma_freq_gte10_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
  • corola_lemma_freq_gte10_lowercase frequency list for lemmas appearing at least 10 times in the corpus, lowercased
  • corola_lemma_freq_gte10_lowercase_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, lowercased and with diacritics removed

 

Number of entries in each of the released files
File # Entries
corola_lemma_freq_all_lowercase_nodiacritics 1,375,725
corola_lemma_freq_all_lowercase 1,457,518
corola_lemma_freq_all_nodiacritics 1,562,523
corola_lemma_freq_all 1,635,250
corola_lemma_freq_gte10_lowercase_nodiacritics 227,590
corola_lemma_freq_gte10_lowercase 235,234
corola_lemma_freq_gte10_nodiacritics 242,325
corola_lemma_freq_gte10 248,593
corola_lemma_freq_gte5_lowercase_nodiacritics 351,596
corola_lemma_freq_gte5_lowercase 365,463
corola_lemma_freq_gte5_nodiacritics 380,751
corola_lemma_freq_gte5 392,053
corola_word_freq_all_lowercase_nodiacritics 1,685,410
corola_word_freq_all_lowercase 1,813,746
corola_word_freq_all_nodiacritics 2,112,107
corola_word_freq_all 2,260,992
corola_word_freq_gte10_lowercase_nodiacritics 358,577
corola_word_freq_gte10_lowercase 381,715
corola_word_freq_gte10_nodiacritics 447,538
corola_word_freq_gte10 473,087
corola_word_freq_gte5_lowercase_nodiacritics 517,630
corola_word_freq_gte5_lowercase 553,031
corola_word_freq_gte5_nodiacritics 650,971
corola_word_freq_gte5 690,676

 

Files

corola_frequencies.zip

Files (114.1 MB)

Name Size Download all
md5:0034d61d1825b38386dfa0bc2606a314
114.1 MB Preview Download