Published April 20, 2023 | Version v1.0
Dataset Open

OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

  • 1. University of Helsinki


OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection.




Files (16.7 MB)

Name Size Download all
16.7 MB Preview Download

Additional details


CorCoDial - Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns 342859
Academy of Finland


  • Aleksandra Miletić and Janine Siewert. 2023. Lemmatization Experiments on Two Low-Resourced Languages: Occitan and Low Saxon. In Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (to appear). Association for Computational Linguistics.