Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)

Ing, Lucence; Gille Levenson, Matthias; Macedo, Carolina

doi:10.5281/zenodo.16992629

Published August 25, 2025 | Version 1

Dataset Open

Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)

1. École Nationale des Chartes
2. École Normale Supérieure de Lyon

This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team.

The corpus provides training material for sentence-level segmentation in historical prose from the 13th to 16th centuries. Texts were selected for their genre diversity (narrative, didactic, legal, theological, scholarly prose) and for their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices.

The current version of the corpus (v1) includes approximately 50,000 segmented excerpts across seven historical languages (Latin, French, Castilian, Catalan, Portuguese, Italian, and English).

Segment boundaries are annotated using the pound sign (£), typically corresponding to sentences or syntactic units. The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.

Files

segmented_data.zip

Files (631.4 kB)

Name	Size	Download all
segmented_data.zip md5:27141da9586aee02c115eb2d4184307d	631.4 kB	Preview Download

Additional details

Subtitle (En): From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose

Submitted: 2025-08-29

Repository URL: https://github.com/ProMeText/multilingual-segmentation-dataset
Development Status: Active

Views

Downloads

Show more details

	All versions	This version
Views	56	56
Downloads	11	11
Data volume	7.6 MB	7.6 MB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Latin, Portuguese, Catalan, Italian, Spanish, English, French

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

You are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes . ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. Read more

Technical metadata

Created: August 29, 2025
Modified: August 29, 2025

segmented_data.zip

Files (631.4 kB)

Additional titles

Dates

Software

Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)

Authors/Creators

Description

Files

segmented_data.zip

Files (631.4 kB)

Additional details

Additional titles

Dates

Software