Published August 25, 2025 | Version 1
Dataset Open

Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)

  • 1. ROR icon École Nationale des Chartes
  • 2. École Normale Supérieure de Lyon

Description

This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team.

The corpus provides training material for sentence-level segmentation in historical prose from the 13th to 16th centuries. Texts were selected for their genre diversity (narrative, didactic, legal, theological, scholarly prose) and for their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices.

The current version of the corpus (v1) includes approximately 50,000 segmented excerpts across seven historical languages (Latin, French, Castilian, Catalan, Portuguese, Italian, and English).

Segment boundaries are annotated using the pound sign (£), typically corresponding to sentences or syntactic units. The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.

Files

segmented_data.zip

Files (631.4 kB)

Name Size Download all
md5:27141da9586aee02c115eb2d4184307d
631.4 kB Preview Download

Additional details

Additional titles

Subtitle (En)
From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose

Dates

Submitted
2025-08-29