Published April 20, 2023
| Version v1.0
Dataset
Open
OcWikiAnnot: Annotated Wikipedia Corpus of Occitan
Description
OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection.
Files
OcWikiAnnot_1.0.zip
Files
(16.7 MB)
Name | Size | Download all |
---|---|---|
md5:c9735154deb820f4de234e795dd29788
|
16.7 MB | Preview Download |
Additional details
Funding
- CorCoDial - Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns 342859
- Academy of Finland
References
- Aleksandra Miletić and Janine Siewert. 2023. Lemmatization Experiments on Two Low-Resourced Languages: Occitan and Low Saxon. In Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (to appear). Association for Computational Linguistics.