Middle Dutch syllabified words

Published July 31, 2018 | Version v3

Dataset Open

Specifics of the data:

Text file (syllabified_crm.txt) containing 43,710 syllabified Middle Dutch words, taken from the Corpus Van Reenen-Mulder. This corpus, created by Pieter van Reenen en Maaike Mulder at the Free University Amsterdam, contains about 2,500 Middle Dutch charters. It has about 750,000 tokens. The charters were written in the Netherlands and Flanders between 1300 and 1400.
The 43,710 syllabified words in this list is the total amount of unique words from the Corpus Van Reenen-Mulder. Some tokens from this corpus were, however, excluded when assembling the data set due to the fact that they contained diacritic symbols to indicate abbreviations, clitics, or unclear parts in the original charter.
A dash-symbol (-) is used as separator.
Apart from the entire data set, this DOI also includes:
- A pdf-file visualizing the data set
- The splits used for the automatic syllabification experiment by Haverals, Kestemont & Karsdorp (2018).
- A gold standard out-of-corpus sample of 1,748 Middle Dutch words, taken at random from the Cd-rom Middelnederlands, also used in the above-mentioned syllabification experiment

Files

Name	Size	Download all
corpus_viz.pdf md5:8b267eb907ae54290287def59a7c84a2	157.5 kB	Preview Download
gold_syllabification_cdrom.txt md5:3fb333fcae7ebcf0865d149e86694dcb	19.3 kB	Preview Download
splits.zip md5:736915614ab80350e4160a239b0f1fe1	195.3 kB	Preview Download
syllabified_crm.txt md5:045201b6ee5bdcbd0d483a3ab739e87b	511.0 kB	Preview Download

Gosse Bouma & Ben Hermans. Syllabification of Middle Dutch. In F. Mambrini, M. Passarotti, and C. Sporleder, editors, Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities, pp. 27-39, 2012.
Pieter van Reenen & Maaike Mulder. Een gegevensbank van 14de- eeuwse Middelnederlandse dialecten op computer. Lexikos 3, pp. 259-281, 1993.
Cd-rom Middelnederlands. Woordenboek en teksten. Sdu, 1998.