Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)
Authors/Creators
Description
This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team.
The corpus provides training material for sentence-level segmentation in historical prose from the 13th to 16th centuries. Texts were selected for their genre diversity (narrative, didactic, legal, theological, scholarly prose) and for their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices.
The current version of the corpus (v1) includes approximately 50,000 segmented excerpts across seven historical languages (Latin, French, Castilian, Catalan, Portuguese, Italian, and English).
Segment boundaries are annotated using the pound sign (£), typically corresponding to sentences or syntactic units. The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.
Files
segmented_data.zip
Files
(631.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:27141da9586aee02c115eb2d4184307d
|
631.4 kB | Preview Download |
Additional details
Additional titles
- Subtitle (En)
- From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose
Dates
- Submitted
-
2025-08-29
Software
- Repository URL
- https://github.com/ProMeText/multilingual-segmentation-dataset
- Development Status
- Active