The Machine-Actionable Ancient Text (MAAT) Corpus

Fitzgerald, William; Barney, Justin

doi:10.5281/zenodo.12553283

Published August 15, 2024 | Version 1.0.0-beta

Dataset Open

The Machine-Actionable Ancient Text (MAAT) Corpus

The Machine-Actionable Ancient Text (MAAT) Corpus is a new resource providing training and evaluation data for restoring lacunae in ancient Greek, Latin, and Coptic texts. Current text restoration systems require large amounts of data for training and task-relevant means for evaluation. The MAAT Corpus addresses this need by converting texts available in EpiDoc XML format into a machine-actionable format that preserves the most textually salient aspects needed for machine learning: the text itself, unclear letters, restorations, and lacunae. Structured test cases are generated from the corpus that align with the actual text restoration task performed by papyrologists and epigraphist, enabling more realistic evaluation than the synthetic tasks used previously. The initial 1.0 beta release contains approximately 134,000 text editions, 178,000 text blocks, and 750,000 individual restorations, with Greek and Latin predominating. This corpus aims to facilitate the development of computational methods to assist scholars in accurately restoring ancient texts.

Files

maat-1.0.0+beta.json

Files (1.7 GB)

Name	Size	Download all
maat-1.0.0+beta.json md5:6d97669ef381f6ecd145d8393d3840d8	1.7 GB	Preview Download

Additional details

Updated: 2024-06-26

First beta update

Repository URL: https://github.com/WMU-Herculaneum-Project/maat

228

Views

Downloads

Show more details

	All versions	This version
Views	228	118
Downloads	59	26
Data volume	85.9 GB	52.5 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

WMU Herculaneum Project

Conference

Machine Learning for Ancient Languages, ACL 2024 Workshop (ML3AL) , Hybrid in Bangkok, Thailand and remote, 15 August 2024

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: June 26, 2024
Modified: June 26, 2024

The Machine-Actionable Ancient Text (MAAT) Corpus

Files

maat-1.0.0+beta.json

Files (1.7 GB)

Additional details

Dates

Software

The Machine-Actionable Ancient Text (MAAT) Corpus

Creators

Description

Files

maat-1.0.0+beta.json

Files (1.7 GB)

Additional details

Dates

Software