Published March 23, 2021 | Version v1
Dataset Open

Enriched Kotus word list

  • 1. UEF

Description

The so called Kotus word list consists of the words in the 1990's Perussanakirja (Basic dictionary of Finnish) and in its original form it is available here: https://kaino.kotus.fi/sanat/nykysuomi/

Here published version of the wordlist of 94 385 lexemes is a modification, that combines information from two sources:

  • UD1 (Universal Dependency Parser) of the Turku NLP group: analysis runs were performed in The Language Bank of Finland
  • Semantic tags based on the UCREL Finnish semantic tag system: https://github.com/UCREL/Multilingual-USAS/tree/master/Finnish with the FiST semantic tagger

 If the word has been tagged with the semantic tags by FiST, the output looks like this:

 aakkonen Noun Q3

 If the word was not analyzed by FiST, it is given its UD1 analysis and tag Z99:

 aallokas NOUN§ Case=Nom|Number=Sing Z99

 UD1 was able to analyze 39 524 of the compounds not analyzed by FiST to constituents. Constituent boundaries are marked with #:

 aallonpituus aallon#pituus NOUN§ Case=Nom|Number=Sing Z99

Many times the constituent boundaries are right, but there are also missing boundaries and odd analyses.

Lexical coverage of FiST with this data is low, 28.68%, due to the fact that the wordlist has about 52 269 compounds. Most of these are not included in the lexicon of FiST. They could, however, many times be analyzed based on their constituents.

Notes

semantic tags

Files

Files (4.4 MB)

Name Size Download all
md5:1db6bce1ea4a6a61ef927b96c5660168
4.4 MB Download