Published June 24, 2024
| Version v2
Software
Open
Tokenization for Occitan (Gascon and Lengadocian)
- 1. Université de Poitiers
- 2. University of Helsinki
Description
A python programme to tokenise texts in Occitan based on rules.
To launch the programme, execute the following instruction:
python3 tokenizer_occitan.py < input.txt > output.conllu
The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself.
The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland).
Files
Files
(18.6 kB)
Name | Size | Download all |
---|---|---|
md5:16565cd9cba138794692e34289da8946
|
16.8 kB | Download |
md5:522d0cf47b5dffc00ac5b27cc2168a6e
|
1.9 kB | Download |
Additional details
Funding
- DIVITAL – Increase the DIgital VITALity and visibility of languages of France: linguistic descriptions and annotated corpora ANR-21-CE27-0004
- Agence Nationale de la Recherche
- CorCoDial - Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns 342859
- Research Council of Finland
- RESTAURE – Computational Resources and Processing for Regional Languages ANR-14-CE24-0003
- Agence Nationale de la Recherche
References
- Bernhard, D., Ligozat, A.-L., Bras, M., Martin, F., Vergez-Couret, M., Erhart, P., Sibille, J., Todirascu, A., Boula de Mareüil, P., Huck, D. (2021). Collecting and annotating corpora for three under-resourced languages of France : Methodological issues, Language Documentation & Conversation, 15, 285-326.
- Aleksandra M. à Tout le monde 11:43 python3 tokenizer_occitan.py < input.txt > output.conllu Aleksandra M. à Tout le monde 11:50 Aleksandra Miletić. 2023. Outiller l'occitan : nouvelles ressources et lemmatisation. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs, pages 217–231, Paris, France. ATALA.