Semantically tagged Finnish Wikipedia 2017
Description
Description of FI Wikipedia 2017 tagging
Kimmo Kettunen
University of Eastern Finland
The tagged data contains the texts of the Finnish Wikipedia of 2017. It has been first tagged syntactically in the Language Bank of Finland using the available UD2 tagger version of the Mylly service (https://mylly.rahtiapp.fi/home).
Semantic tags to the UD2 parse have been added using a lexical semantic tagger FiST (Kettunen, 2019, https://aclanthology.org/W19-0306/).
This published version has been condensed to a format where each analysed word contains the
1. original running word form,
2. lemma of the word form from UD2 parse,
3. part-of-speech of the word from FiST
4. semantic tag(s) for the word from FiST, and
5. syntactic function of the word from UD2 parse.
Semantic tags used are explained in this UCREL Semantic Analysis System (USAS) document: https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf
Tagging includes all the semantic tags available for the word, as FiST does not perform disambiguation. Unknown words for the tagger are marked with tag Z99. Punctuation is tagged with PUNCT and numbers with NUMB. Lines beginning with # are output of UD2 and contain document, paragraph and sentence information.
The output contains 6 415 027 sentences and 98.81 million lines. Lexical coverage of the semantic tagging is 76.59 %
Examples of output
# newdoc
# newpar
# sent_id = 1
# text = Amsterdam
Amsterdam#Amsterdam#Proper#Z2 root
# newpar
# sent_id = 2
# text = Amsterdam on Alankomaiden pääkaupunki.
Amsterdam#Amsterdam#Proper#Z2 nsubj:cop
on#olla#Verb#A3+ A1.1.1 M6 Z5 cop
Alankomaiden#Alankomaat#Proper#Z2 nmod:poss
pääkaupunki#pääkaupunki#Noun#M7 root
. PUNCT
Files
Files
(3.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1401d33795b9c845666dfae74e9864c8
|
3.4 GB | Download |