Published September 19, 2022 | Version v1
Dataset Open

Semantically tagged Finnish Wikipedia 2017

Authors/Creators

  • 1. University of Eastern Finland

Description

Description of FI Wikipedia 2017 tagging

Kimmo Kettunen

University of Eastern Finland

The tagged data contains the texts of the Finnish Wikipedia of 2017. It has been first tagged syntactically in the Language Bank of Finland using the available UD2 tagger version of the Mylly service (https://mylly.rahtiapp.fi/home).

Semantic tags to the UD2 parse have been added using a lexical semantic tagger FiST (Kettunen, 2019, https://aclanthology.org/W19-0306/).

This published version has been condensed to a format where each analysed word contains the

1. original running word form,

2. lemma of the word form from UD2 parse,

3. part-of-speech of the word from FiST

4. semantic tag(s) for the word from FiST, and

5. syntactic function of the word from UD2 parse.

Semantic tags used are explained in this UCREL Semantic Analysis System (USAS) document: https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf

Tagging includes all the semantic tags available for the word, as FiST does not perform disambiguation. Unknown words for the tagger are marked with tag Z99. Punctuation is tagged with PUNCT and numbers with NUMB. Lines beginning with # are output of UD2 and contain document, paragraph and sentence information.

The output contains 6 415 027 sentences and 98.81 million lines. Lexical coverage of the semantic tagging is 76.59 %

Examples of output

# newdoc

# newpar

# sent_id = 1

# text = Amsterdam

Amsterdam#Amsterdam#Proper#Z2 root

# newpar

# sent_id = 2

# text = Amsterdam on Alankomaiden pääkaupunki.

Amsterdam#Amsterdam#Proper#Z2 nsubj:cop

on#olla#Verb#A3+ A1.1.1 M6 Z5 cop

Alankomaiden#Alankomaat#Proper#Z2 nmod:poss

pääkaupunki#pääkaupunki#Noun#M7 root

. PUNCT

Files

Files (3.4 GB)

Name Size Download all
md5:1401d33795b9c845666dfae74e9864c8
3.4 GB Download