Published April 20, 2023 | Version v1.0
Dataset Open

OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

  • 1. University of Helsinki

Description

OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection.

 

 

Files

OcWikiAnnot_1.0.zip

Files (16.7 MB)

Name Size Download all
md5:c9735154deb820f4de234e795dd29788
16.7 MB Preview Download

Additional details

Funding

CorCoDial - Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns 342859
Academy of Finland

References

  • Aleksandra Miletić and Janine Siewert. 2023. Lemmatization Experiments on Two Low-Resourced Languages: Occitan and Low Saxon. In Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (to appear). Association for Computational Linguistics.