CoRoLa Frequency Lists
Creators
- 1. Research Institute for Artificial Intelligence "Mihai Drăgănescu", Romanian Academy
Description
The Reference Corpus for Contemporary Romanian Language (CoRoLa) was constructed as a priority project of the Romanian Academy. It contains both written texts and oral recordings. Its aim is to cover major functional language styles (legal, scientific, journalistic, imaginative, memoirs, administrative), in four domains (arts and culture, nature, society, science) and in 71 sub-domains while taking into account intellectual property rights (IPR). With over 1 billion word tokens (written and spoken), CoRoLa is one of the largest fully IPR-cleared Reference Corpus in the world. https://corola.racai.ro
This dataset contains multiple frequency lists extracted from CoRoLa. There are 12 word-based frequency lists and 12 lemma-based frequency lists. These were constructed only from tokens containing letters (tokens with numbers or special symbols were excluded). Lemmatization was performed automatically at corpus level using the TTL tool. The following files are available:
- corola_word_freq_all frequency list for all tokens, as they appear in the corpus
- corola_word_freq_all_nodiacritics frequency list for all tokens, with diacritics removed (replaced with ASCII corresponding letters)
- corola_word_freq_all_lowercase frequency list for all tokens lowercased
- corola_word_freq_all_lowercase_nodiacritics frequency list for all tokens lowercased and with diacritics removed
- corola_word_freq_gte5 frequency list for tokens appearing at least 5 times in the corpus
- corola_word_freq_gte5_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
- corola_word_freq_gte5_lowercase frequency list for tokens appearing at least 5 times in the corpus, lowercased
- corola_word_freq_gte5_lowercase_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, lowercased and with diacritics removed
- corola_word_freq_gte10 frequency list for tokens appearing at least 10 times in the corpus
- corola_word_freq_gte10_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
- corola_word_freq_gte10_lowercase frequency list for tokens appearing at least 10 times in the corpus, lowercased
- corola_word_freq_gte10_lowercase_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, lowercased and with diacritics removed
- corola_lemma_freq_all frequency list for all lemmas, as they appear in the corpus
- corola_lemma_freq_all_nodiacritics frequency list for all lemmas, with diacritics removed (replaced with ASCII corresponding letters)
- corola_lemma_freq_all_lowercase frequency list for all lemmas lowercased
- corola_lemma_freq_all_lowercase_nodiacritics frequency list for all lemmas lowercased and with diacritics removed
- corola_lemma_freq_gte5 frequency list for lemmas appearing at least 5 times in the corpus
- corola_lemma_freq_gte5_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
- corola_lemma_freq_gte5_lowercase frequency list for lemmas appearing at least 5 times in the corpus, lowercased
- corola_lemma_freq_gte5_lowercase_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, lowercased and with diacritics removed
- corola_lemma_freq_gte10 frequency list for lemmas appearing at least 10 times in the corpus
- corola_lemma_freq_gte10_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
- corola_lemma_freq_gte10_lowercase frequency list for lemmas appearing at least 10 times in the corpus, lowercased
- corola_lemma_freq_gte10_lowercase_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, lowercased and with diacritics removed
File | # Entries |
corola_lemma_freq_all_lowercase_nodiacritics | 1,375,725 |
corola_lemma_freq_all_lowercase | 1,457,518 |
corola_lemma_freq_all_nodiacritics | 1,562,523 |
corola_lemma_freq_all | 1,635,250 |
corola_lemma_freq_gte10_lowercase_nodiacritics | 227,590 |
corola_lemma_freq_gte10_lowercase | 235,234 |
corola_lemma_freq_gte10_nodiacritics | 242,325 |
corola_lemma_freq_gte10 | 248,593 |
corola_lemma_freq_gte5_lowercase_nodiacritics | 351,596 |
corola_lemma_freq_gte5_lowercase | 365,463 |
corola_lemma_freq_gte5_nodiacritics | 380,751 |
corola_lemma_freq_gte5 | 392,053 |
corola_word_freq_all_lowercase_nodiacritics | 1,685,410 |
corola_word_freq_all_lowercase | 1,813,746 |
corola_word_freq_all_nodiacritics | 2,112,107 |
corola_word_freq_all | 2,260,992 |
corola_word_freq_gte10_lowercase_nodiacritics | 358,577 |
corola_word_freq_gte10_lowercase | 381,715 |
corola_word_freq_gte10_nodiacritics | 447,538 |
corola_word_freq_gte10 | 473,087 |
corola_word_freq_gte5_lowercase_nodiacritics | 517,630 |
corola_word_freq_gte5_lowercase | 553,031 |
corola_word_freq_gte5_nodiacritics | 650,971 |
corola_word_freq_gte5 | 690,676 |
Files
corola_frequencies.zip
Files
(114.1 MB)
Name | Size | Download all |
---|---|---|
md5:0034d61d1825b38386dfa0bc2606a314
|
114.1 MB | Preview Download |