OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

Miletic, Aleksandra

doi:10.5281/zenodo.7777340

Published April 20, 2023 | Version v1.0

Dataset Open

OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

Miletic, Aleksandra¹

1. University of Helsinki

OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection.

Files

OcWikiAnnot_1.0.zip

Files (16.7 MB)

Name	Size	Download all
OcWikiAnnot_1.0.zip md5:c9735154deb820f4de234e795dd29788	16.7 MB	Preview Download

Additional details

Research Council of Finland
CorCoDial - Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns 342859

Aleksandra Miletić and Janine Siewert. 2023. Lemmatization Experiments on Two Low-Resourced Languages: Occitan and Low Saxon. In Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (to appear). Association for Computational Linguistics.

607

Views

113

Downloads

Show more details

	All versions	This version
Views	607	602
Downloads	113	111
Data volume	2.0 GB	1.9 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Occitan (post 1500)

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: April 20, 2023
Modified: April 20, 2023

OcWikiAnnot_1.0.zip

Files (16.7 MB)

Funding

References

OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

Authors/Creators

Description

Files

OcWikiAnnot_1.0.zip

Files (16.7 MB)

Additional details

Funding

References