2024-03-29T00:15:28Z
https://zenodo.org/oai2d
oai:zenodo.org:1493642
2020-01-25T07:23:48Z
user-restaure
software
Todirascu Amalia
2018-11-21
<p>This software is developed for the tokenisation of Picard texts, e.g. splitting sentences into words and ponctuation signs. The tokeniser handles ambiguous separators such as dash, apostrophe, dot.</p>
<p>The software is developed in Perl 5.22.1. The installation and the running issues are explained in the script file.</p>
https://doi.org/10.5281/zenodo.1493642
oai:zenodo.org:1493642
pcd
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1493641
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
DiLiTAL 2017, Actes de l'atelier « Diversité Linguistique et TAL », Orléans, France, Jun 2017
tokenisation, rule-based method
Tokeniser for Picard
info:eu-repo/semantics/other
oai:zenodo.org:1171925
2023-02-15T07:59:37Z
user-restaure
Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie
2018-02-09
<p>These guidelines were produced in the context of the RESTAURE project, funded by the French ANR. They were used to annotate the following corpus: <a href="https://doi.org/10.5281/zenodo.1170128">10.5281/zenodo.1170128</a></p>
<p>They detail the tags along with examples, and provide answers to some specific linguistic issues.</p>
<p>The guidelines are written in French.</p>
https://doi.org/10.5281/zenodo.1171925
oai:zenodo.org:1171925
fra
Zenodo
https://doi.org/10.5281/zenodo.1170128
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1171924
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Annotation
Alsatian
Part-of-speech
Lemma
Part-of-Speech Annotation Guidelines for the Alsatian Dialects
info:eu-repo/semantics/technicalDocumentation
oai:zenodo.org:1484520
2020-01-24T19:24:53Z
user-restaure
openaire_data
Ligozat, Anne-Laure
2018-11-12
<p>OpenNLP tokenization model for Picard, trained on the Restaure corpus.</p>
<p>The apostrophes must be standardized in the input file: l’bas -> l'bas</p>
<p>To tokenize a file: <input-file.txt opennlp TokenizerME pcd-token.bin</p>
https://doi.org/10.5281/zenodo.1484520
oai:zenodo.org:1484520
pcd
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1484519
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
tokenization
natural language processing
OpenNLP tokenization model for Picard
info:eu-repo/semantics/other
oai:zenodo.org:1174214
2020-01-24T19:25:01Z
user-restaure
openaire_data
Bernhard, Delphine
Steiblé, Lucie
2018-02-16
<p>This dataset contains a collection of pronunciation dictionaries which were manually transcribed using the X-SAMPA transcription system. The transcriptions were performed based on audio recordings available on the following websites :</p>
<ul>
<li>OLCA: <a href="http://www.olcalsace.org/fr/lexiques">http://www.olcalsace.org/fr/lexiques</a></li>
<li>Elsässich Web diktionnair: <a href="http://www.ami-hebdo.com/elsadico/index.php">http://www.ami-hebdo.com/elsadico/index.php</a></li>
</ul>
<p>The dataset was produced in the context of the RESTAURE project, funded by the French ANR. The transcription process is described in the following research report : 10.5281/zenodo.1174219. It is also detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704814">http://hal.archives-ouvertes.fr/hal-01704814</a></p>
<p>Three pronunciation dictionaries are available :</p>
<ul>
<li>elsassich_dico.csv : 702 transcriptions from the "Elsässich Web diktionnair"</li>
<li>olca67.csv : 1,458 transcriptions from the "OLCA" lexicons for the northern part of the Alsace region (Bas-Rhin)</li>
<li>olca68.csv : 1,401 transcriptions from the "OLCA" lexicons for the southern part of the Alsace region (Haut-Rhin)</li>
</ul>
https://doi.org/10.5281/zenodo.1174214
oai:zenodo.org:1174214
Zenodo
http://hal.archives-ouvertes.fr/hal-01704814
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1174213
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Pronunciation dictionary
Linguistics
Alsatian Dialects
Pronunciation Dictionaries for the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:1484398
2020-01-24T19:25:54Z
user-restaure
openaire_data
Rosset, Sophie
Lavergne, Thomas
Magistry, Pierre
Ligozat, Anne-Laure
Martin, Fanny
Rey, Christophe
Reynés, Philippe
2018-02-13
<p>This corpus contains a collection of texts in Picard which were manually annotated with parts-of-speech, lemmas, translations into French and location entities. The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 25 documents. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p>The untokenised and unannotated versions of the documents are found in the “extraits_reference_bruts” folder when available. The annotated versions of the documents are found in the “picud” folder. They are provided in the <a href="http://universaldependencies.org/format.html">CoNLL-U format</a>. Additional information is also given;</p>
<ul>
<li>(inflected) translation into French</li>
<li>4 features representing the annotation of location names (<a href="http://www.quaero.org/media/files/bibliographie/quaero-guide-annotation-2011.pdf">Quaero categories</a>)</li>
<li>2 features indicating whether the token is part of a term (composed noun or locution for example)</li>
</ul>
<p>If the lemma or French translation for a token is the same as the x<sup>th</sup> token before, it is annotated with "IDEM-x".</p>
<p>The csv file "liste_textes_distribues" contains details for each text: author, title of the book, publishing year, code, genre and linguistic area.</p>
https://doi.org/10.5281/zenodo.1484398
oai:zenodo.org:1484398
pcd
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1172575
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Picard
Part-of-speech
Lemma
Annotated Corpus for Picard
info:eu-repo/semantics/other
oai:zenodo.org:1172576
2020-01-24T19:25:54Z
user-restaure
openaire_data
Rosset, Sophie
Lavergne, Thomas
Magistry, Pierre
Ligozat, Anne-Laure
Martin, Fanny
Rey, Christophe
Reynés, Philippe
2018-02-13
<p>This corpus contains a collection of texts in Picard which were manually annotated with parts-of-speech, lemmas, translations into French and location entities. The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 25 documents. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p>The untokenised and unannotated versions of the documents are found in the “extraits_reference_bruts” folder when available. The annotated versions of the documents are found in the “extraits_reference_annotes” folder. They are provided in a CSV format with the following columns:</p>
<ul>
<li>word form</li>
<li>part-of-speech</li>
<li>word lemma</li>
<li>translation into French</li>
<li>4 columns represent the annotated of location names (Quaero categories)</li>
<li>2 columns indicate whether the token is part of a term (composed noun or locution for example)</li>
</ul>
<p>If the lemma or French translation for a token is the same as the x<sup>th</sup> token before, it is annotated with "IDEM-x".</p>
<p>The csv file "liste_textes_distribues" contains details for each text: author, title of the book, publishing year, code, genre and linguistic area.</p>
https://doi.org/10.5281/zenodo.1172576
oai:zenodo.org:1172576
pcd
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1172575
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Picard
Part-of-speech
Lemma
Annotated Corpus for Picard
info:eu-repo/semantics/other
oai:zenodo.org:1485988
2020-01-24T19:25:53Z
user-restaure
openaire_data
Rosset, Sophie
Lavergne, Thomas
Magistry, Pierre
Ligozat, Anne-Laure
Martin, Fanny
Rey, Christophe
Reynés, Philippe
2018-02-13
<p>This corpus contains a collection of texts in Picard which were manually annotated with parts-of-speech, lemmas, translations into French and location entities. The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 25 documents. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p>The untokenised and unannotated versions of the documents are found in the “extraits_reference_bruts” folder when available. The annotated versions of the documents are found in the the “extraits_reference_annotes” folder (original CSV file) and “picud” folder (<a href="http://universaldependencies.org/format.html">CoNLL-U format</a>). Additional information is also given;</p>
<ul>
<li>(inflected) translation into French</li>
<li>4 features representing the annotation of location names (<a href="http://www.quaero.org/media/files/bibliographie/quaero-guide-annotation-2011.pdf">Quaero categories</a>)</li>
<li>2 features indicating whether the token is part of a term (composed noun or locution for example)</li>
</ul>
<p>If the lemma or French translation for a token is the same as the x<sup>th</sup> token before, it is annotated with "IDEM-x".</p>
<p>The csv file "liste_textes_distribues" contains details for each text: author, title of the book, publishing year, code, genre and linguistic area.</p>
https://doi.org/10.5281/zenodo.1485988
oai:zenodo.org:1485988
pcd
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1172575
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Picard
Part-of-speech
Lemma
Annotated Corpus for Picard
info:eu-repo/semantics/other
oai:zenodo.org:2533877
2019-01-08T10:49:17Z
user-restaure
Marianne Vergez-Couret
2019-01-08
<p>This dataset provides trained Tesseract (<a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>) and Jochre (<a href="https://github.com/urieli/jochre">https://github.com/urieli/jochre</a>) OCR models for Occitan ( for the standard spelling and two dialects, Gascon and Lengadocian). These models were developed in the context of the RESTAURE project, funded by the French ANR. </p>
<p>Two models are provided. They were presented in the following article <a href="http://hal.archives-ouvertes.fr/hal-01252241">https://hal.archives-ouvertes.fr/hal-01252241</a> and also re-evaluated for the creation of another corpus in <a href="https://www.openscience.fr/Constitution-et-annotation-d-un-corpus-ecrit-de-contes-et-recits-en-occitan">https://www.openscience.fr/Constitution-et-annotation-d-un-corpus-ecrit-de-contes-et-recits-en-occitan</a>.</p>
<p>The first model for Jochre, JOCHRE_2015, has been trained for Jochre 1.1.2b. The training images and corresponding texts were manually annotated using a Jochre online platform (excerpts from 7 different printed works, totalling about 20,000 words)</p>
<p>The second model for Tesseract, TESS_2015, was trained using the jTessBoxEditor tool (<a href="http://vietocr.sourceforge.net/training.html">http://vietocr.sourceforge.net/training.html</a>), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (the one used for Jochre). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants. The Tesseract model can be used with Tesseract 3.0x.</p>
<p>List of words was also used for those two trainings. We conflated Occitan words found in several lexicons, dictionaries and corpora for the two dialects, Gascon and Lengadocian:</p>
<ul>
<li>Lexicon extracted from 60 literary works (from 29 different authors) gathered in the BaTelÒc project.</li>
<li>Dictonary entries from <em>Dictionnaire Français/Occitan Gascon Toulousain</em> de Nicolau Rei Bèthvéder, 2004, IEO Edicions</li>
<li>Dictonary entries from <em>Dictionnaire Français/Occitan</em> de Cristian Laus, 2004, IEO/IDECO</li>
<li>Dictonary entries from <em>Dictionnaire Français/Occitan (Gascon)</em> de Miquèu Grosclaude, Gilabèrt Nariòo e Patric Guilhemjoan, 2007, Per Noste Edicions</li>
<li>Conjugated forms from Verb’Òc (designed by the <em>Congrès permanent de la lenga occitana</em> (<a href="http://www.locongres.org">http://www.locongres.org</a>))</li>
<li>List of proper nouns extracted from the Apertium (free/open-source machine translation platform) Occitan lexicon.</li>
</ul>
<p>The jochre model can be used with the Jochre software (<a href="https://github.com/urieli/jochre">https://github.com/urieli/jochre</a>). See also Jochre wiki (https://github.com/urieli/jochre/wiki).</p>
<p>The Tesseract models can be used for instance using the gImageReader tool (<a href="https://github.com/manisandro/gImageReader">https://github.com/manisandro/gImageReader</a>), which provides a graphical user interface for the Tesseract tool. </p>
<p>When evaluated against the same test corpus (four extracts from four different authors from two dialects, Gascon and Lengadocian), the Jochre model achieves better performance levels.</p>
<p> </p>
https://doi.org/10.5281/zenodo.2533877
oai:zenodo.org:2533877
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.2533876
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
OCR module, Tesseract, Jochre, Occitan, OCR
OCR models for Occitan (standard spelling)
info:eu-repo/semantics/other
oai:zenodo.org:1173113
2020-01-20T16:31:54Z
user-restaure
Bras Myriam
2018-02-14
<p>These guidelines were produced in the context of the RESTAURE project, funded by the French ANR.</p>
<p>They detail the tags along with examples, and provide answers to some specific linguistic issues.</p>
<p>The guidelines are written in Occitan and French.</p>
https://doi.org/10.5281/zenodo.1173113
oai:zenodo.org:1173113
oci
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1173112
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Annotation
Part Of Speech
Linguistics
Lemma
Occitan
Part Of Speech Annotation Guidelines for the Occitan Language
info:eu-repo/semantics/technicalDocumentation
oai:zenodo.org:2454993
2023-02-15T08:00:24Z
user-restaure
software
Bernhard, Delphine
2018-12-20
<p>A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to use the tokeniser.</p>
<p>The module requires Python 2.7.</p>
<p>This tool was developed in the context of the RESTAURE project, funded by the French ANR. The tokeniser is also decribed in the following article: https://hal.archives-ouvertes.fr/hal-01539160.</p>
<p>Version 1.4.1 fixes a bug occurring when the space is missing after a comma.</p>
https://doi.org/10.5281/zenodo.2454993
oai:zenodo.org:2454993
Zenodo
https://hal.archives-ouvertes.fr/hal-01539160
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1404896
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Natural Language Processing
Alsatian
Tokenisation
Tokeniser for the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:1170129
2021-02-19T10:32:23Z
user-restaure
openaire_data
Dorffer, Clément
Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie
2018-02-09
<p>This corpus contains a collection of texts in the Alsatian dialects which were manually annotated with parts-of-speech, lemmas, translations into French and location entities.</p>
<p>The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 21 documents and 12,570 tokens. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p>The untokenised and unannotated versions of the documents are found in the “txt” folder. The annotated versions of the documents are found in the “annotated” folder. They are provided in a TSV format with the following columns:</p>
<ul>
<li>id: token index in the document</li>
<li>form: word form</li>
<li>translation: translation into French</li>
<li>lemma: word lemma</li>
<li>pos: part-of-speech</li>
<li>location: Begin-Inside tags for location entities</li>
</ul>
https://doi.org/10.5281/zenodo.1170129
oai:zenodo.org:1170129
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://doi.org/10.5281/zenodo.1171925
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1170128
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Alsatian
Part-of-speech
Lemma
Annotated Corpus for the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:2536041
2021-02-19T10:32:25Z
user-restaure
openaire_data
Dorffer, Clément
Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie
2019-01-09
<p>This corpus contains a collection of texts in the Alsatian dialects which were manually annotated with parts-of-speech, lemmas, translations into French and location entities.</p>
<p>The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 21 documents and 12,570 tokens. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p><strong>Information about version 2</strong></p>
<p>Version 2 contains the same annotated documents as version 1, but some errors have been corrected and the annotated corpus is provided in the <a href="http://universaldependencies.org/format.html">CoNLL-U format</a></p>
<p>The untokenised and unannotated versions of the documents are found in the “txt” folder. The annotated versions of the documents are found in the "ud" folder (<a href="http://universaldependencies.org/format.html">CoNLL-U format</a>).</p>
<p>In addition to the form, the lemma and the part-of-speechn additional information is also provided:</p>
<ul>
<li>translation of the lemma into French (Gloss field)</li>
<li>annotation of location names (NamedType field)</li>
</ul>
https://doi.org/10.5281/zenodo.2536041
oai:zenodo.org:2536041
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://doi.org/10.5281/zenodo.1171925
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1170128
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Alsatian
Part-of-speech
Lemma
Annotated Corpus for the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:2533873
2020-01-24T19:27:15Z
user-restaure
software
Marianne Vergez-Couret
2019-01-08
<p>A perl programme to tokenise texts in Occitan.</p>
<p>The programme is an adaptation from the perl programme to tokenize texts in French made by Tanguy et Hathout (2007) in its extended version (that is to say with a list of exceptions).</p>
<p>To launch the programme, execute the following instruction:</p>
<p><strong>perl</strong> segmenteur_occitan.pl exceptions_occitan.txt <input >output</p>
<p>This tool was developed in the context of the RESTAURE project, funded by the French ANR.</p>
https://doi.org/10.5281/zenodo.2533873
oai:zenodo.org:2533873
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.2533872
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
tokenization, occitan
Tokenization for Occitan (Gascon and Lengadocian)
info:eu-repo/semantics/other
oai:zenodo.org:1296285
2020-01-20T14:24:45Z
user-restaure
Bras, Myriam
Esher, Louise
Sibille, Jean
Vergez-Couret, Marianne
2018-02-22
<p>This is the corpus description of a set of data containing a collection of texts in several dialects of Occitan (lengadocian, gascon, provençau, vivaro-aupenc, auvernhàs, lemosin) manually annotated with parts-of-speech and lemmas available in :</p>
<p>DOI:10.5281/zenodo.1182949.</p>
<p> </p>
https://doi.org/10.5281/zenodo.1296285
oai:zenodo.org:1296285
oci
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1182932
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
corpus description
POS
occitan
Annotated Corpus for Occitan : Corpus Description
info:eu-repo/semantics/technicalDocumentation
oai:zenodo.org:1484393
2020-01-24T19:26:01Z
user-restaure
openaire_data
Rosset, Sophie
Lavergne, Thomas
Magistry, Pierre
Ligozat, Anne-Laure
Martin, Fanny
Rey, Christophe
Reynés, Philippe
2018-02-13
<p>This corpus contains a collection of texts in Picard which were manually annotated with parts-of-speech, lemmas, translations into French and location entities. The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 25 documents. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p>The untokenised and unannotated versions of the documents are found in the “extraits_reference_bruts” folder when available. The annotated versions of the documents are found in the “extraits_reference_annotes” folder. They are provided in a CSV format with the following columns:</p>
<ul>
<li>word form</li>
<li>part-of-speech</li>
<li>word lemma</li>
<li>translation into French</li>
<li>4 columns represent the annotated of location names (Quaero categories)</li>
<li>2 columns indicate whether the token is part of a term (composed noun or locution for example)</li>
</ul>
<p>If the lemma or French translation for a token is the same as the x<sup>th</sup> token before, it is annotated with "IDEM-x".</p>
<p>The csv file "liste_textes_distribues" contains details for each text: author, title of the book, publishing year, code, genre and linguistic area.</p>
https://doi.org/10.5281/zenodo.1484393
oai:zenodo.org:1484393
pcd
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1172575
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Picard
Part-of-speech
Lemma
Annotated Corpus for Picard
info:eu-repo/semantics/other
oai:zenodo.org:1404873
2023-02-15T08:01:44Z
user-restaure
openaire_data
Bernhard, Delphine
2018-08-28
<p>This dataset contains a lexicon of place names in the Alsatian dialects. These place names were collected from several resources and manually categorised according to location types defined in the QUAERO project:</p>
<ul>
<li>loc.fac – Facility</li>
<li>loc.phys.astro – Astronym</li>
<li>loc.phys.geo – Geonym</li>
<li>loc.phys.hydro – Hydronym</li>
<li>loc.adm.nat – Country</li>
<li>loc.adm.reg – Region</li>
<li>loc.adm.sup – Supranational</li>
<li>loc.adm.town – City</li>
<li>loc.oro – Odonym</li>
</ul>
<p>The CSV file contains 4 columns:<br>
1. Place name in Alsatian<br>
2. Place name in French<br>
3. Quaero category<br>
4. Source(s): WikiAls (articles from the Alemannic Wikipedia), WikiFr (articles from the French Wikipedia), corpus (Wikipedia articles from the Alemannic Wikipedia and chronicles from an information magazine published by the Haut-Rhin department (southern Alsace) General Council). This field also indicates whether the same spelling can be found in other lexicons of place names for Alsatian: AlsaDico (Edmond Jung. <em>L’alsadico : 22 000 mots et expressions français-alsacien</em>. La<br>
Nuée bleue, Strasbourg, 2006.) and Elsàsser (Marc Hug. <em>Toponymes d’Alsace</em>. Online, <a href="http://elsasser.free.fr/NomCommu/ecrantot.html">http://elsasser.free.fr/<br>
NomCommu/ecrantot.html</a>, 2007.)</p>
<p>The dataset was produced in the context of the RESTAURE project, funded by the French ANR. The lexicon is also decribed in the following article: <a href="https://hal.archives-ouvertes.fr/hal-01702656">https://hal.archives-ouvertes.fr/hal-01702656</a>.</p>
https://doi.org/10.5281/zenodo.1404873
oai:zenodo.org:1404873
Zenodo
https://hal.archives-ouvertes.fr/hal-01702656
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1404872
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Linguistics
Alsatian
Place names
Named Entities
Lexicon of Place Names in the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:1182933
2020-01-20T14:27:15Z
user-restaure
Bras, Myriam
Esher, Louise
Sibille, Jean
Vergez-Couret, Marianne
2018-02-22
<p>This is the corpus description of a set of data containing a collection of texts in several dialects of Occitan (lengadocian, gascon, provençau, vivaro-aupenc, auvernhàs, lemosin) manually annotated with parts-of-speech and lemmas available in :</p>
<p> </p>
https://doi.org/10.5281/zenodo.1182933
oai:zenodo.org:1182933
oci
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1182932
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
corpus description
POS
occitan
Annotated Corpus for Occitan : Corpus Description
info:eu-repo/semantics/technicalDocumentation
oai:zenodo.org:1404897
2023-02-15T08:00:23Z
user-restaure
software
Bernhard, Delphine
2018-08-28
<p>A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to use the tokeniser.</p>
<p>The module requires Python 2.7.</p>
<p>This tool was developed in the context of the RESTAURE project, funded by the French ANR. The tokeniser is also decribed in the following article: https://hal.archives-ouvertes.fr/hal-01539160.</p>
https://doi.org/10.5281/zenodo.1404897
oai:zenodo.org:1404897
Zenodo
https://hal.archives-ouvertes.fr/hal-01539160
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1404896
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Natural Language Processing
Alsatian
Tokenisation
Tokeniser for the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:1174219
2023-02-15T08:02:11Z
user-restaure
Steiblé, Lucie
Bernhard, Delphine
2018-02-21
<p>This report describes the transcription process and guidelines for the following dataset : 10.5281/zenodo.1174214<br>
The report is written in French.</p>
https://doi.org/10.5281/zenodo.1174219
oai:zenodo.org:1174219
fra
Zenodo
https://doi.org/10.5281/zenodo.1174214
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1174218
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Phonetic transcription
Linguistics
Alsatian Dialects
Phonetic Transcription for the Alsatian Dialects
info:eu-repo/semantics/technicalDocumentation
oai:zenodo.org:10132307
2023-11-15T09:44:08Z
user-restaure
openaire_data
Dorffer, Clément
Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie
2023-11-15
<p>This corpus contains a collection of texts in the Alsatian dialects which were manually annotated with parts-of-speech, lemmas, translations into French and location entities.</p><p>The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 21 documents and 12,907 syntactic words. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p><p><strong>Information about version 3</strong></p><p>Version 3 corrects some minor errors in the CONLL-U files: wrong token indexes after multiword tokens and missing _ in glosses. In addition, all files are concatenated into a single CONLL-U file.</p><p><strong>Information about version 2</strong></p><p>Version 2 contains the same annotated documents as version 1, but some errors have been corrected and the annotated corpus is provided in the <a href="http://universaldependencies.org/format.html">CoNLL-U format</a></p><p>The untokenised and unannotated versions of the documents are found in the "txt" folder. The annotated versions of the documents are found in the "ud" folder (<a href="http://universaldependencies.org/format.html">CoNLL-U format</a>).</p><p>In addition to the form, the lemma and the part-of-speech additional information is also provided:</p><ul><li>translation of the lemma into French (Gloss field)</li><li>annotation of location names (NamedType field)</li></ul>
https://doi.org/10.5281/zenodo.10132307
oai:zenodo.org:10132307
Zenodo
http://hal.archives-ouvertes.fr/hal-01704806
https://doi.org/10.5281/zenodo.1171925
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1170128
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Linguistics
Natural Language Processing
Alsatian
Part-of-speech
Lemma
Annotated Corpus for the Alsatian Dialects
info:eu-repo/semantics/other
oai:zenodo.org:1173428
2020-01-20T14:22:58Z
user-restaure
Martin, Fanny
Rey, Christophe
Reynés, Philippe
2018-05-07
<p>These guidelines were produced in the context of the RESTAURE project, funded by the French ANR. They were used to annotate the following corpus: <a href="https://doi.org/10.5281/zenodo.1172575">10.5281/zenodo.1172575</a></p>
<p>They detail the tags along with examples, and provide answers to some specific linguistic issues.</p>
<p>The guidelines are written in French.</p>
https://doi.org/10.5281/zenodo.1173428
oai:zenodo.org:1173428
Zenodo
https://doi.org/10.5281/zenodo.1172575
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1173427
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Corpus
Annotation
Picard
Part-of-speech
Lemma
Linguistics
Part-of-Speech Annotation Guidelines for Picard
info:eu-repo/semantics/technicalDocumentation
oai:zenodo.org:1182949
2020-01-24T19:26:16Z
user-restaure
openaire_data
Bras, Myriam
Esher, Louise
Sibille, Jean
Vergez-Couret, Marianne
2018-06-22
<p>This corpus contains a collection of texts in Occitan which were manually annotated with parts-of-speech, lemmas.</p>
<p>The corpus was produced in the context of the RESTAURE project, funded by the French ANR. The current version of the corpus contains 28 documents and 12,425 tokens. The annotation process is detailed in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01704806">http://hal.archives-ouvertes.fr/hal-01704806</a></p>
<p>The annotated versions are provided in a TSV CoNLL-U format.</p>
https://doi.org/10.5281/zenodo.1182949
oai:zenodo.org:1182949
oci
Zenodo
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1182948
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Occitan
Corpus
Linguistics
Part Of Speech
Natural Language Processing
Lemma
Annotated Corpus for Occitan
info:eu-repo/semantics/other
oai:zenodo.org:1404914
2023-02-15T08:01:23Z
user-restaure
openaire_data
Bernhard, Delphine
2018-08-28
<p>This dataset provides trained Tesseract (<a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>) OCR models for the Alsatian dialects. These models were developed in the context of the RESTAURE project, funded by the French ANR. </p>
<p>Two models are provided :</p>
<p>The first model, ISKO_2015, has been presented in the following article: <a href="http://hal.archives-ouvertes.fr/hal-01252241">https://hal.archives-ouvertes.fr/hal-01252241</a>. The Tesseract model has been trained using the jTessBoxEditor tool (<a href="http://vietocr.sourceforge.net/training.html">http://vietocr.sourceforge.net/training.html</a>), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (excerpts from 7 different printed works, totalling about 9,000 words). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants.<br>
The Tesseract model (gsw.traineddata) can be used with Tesseract 3.0x.</p>
<p>The second model, 2018, has been trained for Tesseract 4.0x, using jTessBoxEditor version 2.0.1 (28 July 2018). Again, images were automatically generated from the training text. The training text is different from the one used for the ISKO_2015 model and is "artificial", in the sense that it has been built by appending word n-grams extracted from a large variety of published texts in Alsatian, for a time period spanning 2 centuries and for different text genres. The images corresponding to this training text have been automatically generated with the Tesseract text2image tool, using the following parameters: --ptsize=36 --leading=20. The fonts used are listed in the gsw.font_properties file.</p>
<p>Dictionary data has also been used for training. We conflated Alsatian words found in several lexicons and corpora:</p>
<ul>
<li>Lexicons produced by the OLCA (Office pour la Langue et les Cultures d'Alsace et de Moselle): <a href="http://www.olcalsace.org/fr/lexiques">http://www.olcalsace.org/fr/lexiques</a></li>
<li>Lexicon from a Wiktionary user page: <a href="http://fr.wiktionary.org/wiki/Utilisateur:Laurent_Bouvier/alsacien-fran%C3%A7ais">https://fr.wiktionary.org/wiki/Utilisateur:Laurent_Bouvier/alsacien-fran%C3%A7ais</a></li>
<li>Lexicon from the ACPA association: <a href="http://web.archive.org/web/20160302234127/http:/culture.alsace.pagesperso-orange.fr/dictionnaire_alsacien.htm">http://web.archive.org/web/20160302234127/http:/culture.alsace.pagesperso-orange.fr/dictionnaire_alsacien.htm</a></li>
<li>Chronicles published by Raymond Matzen in the local newspaper "Les Dernières Nouvelles d'Alsace"</li>
<li>Transcriptions of television shows found in Erhart, P. (2012). <em>Les dialectes dans les médias: quelle image de l’Alsace véhiculent-ils dans les émissions de la télévision régionale?</em>, Université de Strasbourg, <a href="http://www.theses.fr/167563386">http://www.theses.fr/167563386</a></li>
<li>French-Alsatian parallel corpus provided by the OLCA</li>
<li>Excerpts from Adolf, P. (2006). <em>Dictionnaire comparatif multilingue: français-allemand-alsacien-anglais.</em>, Strasbourg, France, Midgard, 2006, 373 p.</li>
</ul>
<p>The Tesseract models can be used for instance using the gImageReader tool (<a href="https://github.com/manisandro/gImageReader">https://github.com/manisandro/gImageReader</a>), which provides a graphical user interface for the Tesseract tool. </p>
<p>When evaluated against the same test corpus (prose by Marie Hart, theater and poetry by Gustave Stokopf and prose by Charles Zumstein, totalling about 4,900 words), both models achieve roughly the same performance levels. Usually, even better performance levels can be achieved by combining the Alsatian-specific model with the French and German models available for Tesseract (available from <a href="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</a>)</p>
https://doi.org/10.5281/zenodo.1404914
oai:zenodo.org:1404914
Zenodo
https://hal.archives-ouvertes.fr/hal-01252241
https://zenodo.org/communities/restaure
https://doi.org/10.5281/zenodo.1404913
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 4.0 International
https://creativecommons.org/licenses/by-sa/4.0/legalcode
OCR
Tesseract
Alsatian
Tesseract OCR models for the Alsatian dialects
info:eu-repo/semantics/other