The MERIT dataset: Modelling and efficiently rendering interpretable transcripts - Part 1 of 3

de Rodrigo, Ignacio; Boal Martín-Larrauri, Jaime; López-López, Álvaro Jesús; Sánchez-Cuadrado, Alberto

doi:10.5281/zenodo.18392672

Published January 27, 2026 | Version v1

Dataset Open

The MERIT dataset: Modelling and efficiently rendering interpretable transcripts - Part 1 of 3

1. Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University

[Partition 1/3: original samples in English and Spanish] This dataset is linked to the MERIT Dataset paper, which introduces the MERIT Dataset, a multimodal, fully labeled dataset of school grade reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a resource for training models in demanding Visually-rich Document Understanding tasks. It contains multimodal features that link patterns in the textual, visual, and layout domains. The MERIT Dataset also includes biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models. The paper outlines the dataset’s generation pipeline and highlights its main features and patterns in its different domains. We benchmark the dataset for token classification, showing that it poses a significant challenge even for SOTA models.

Files

Files (13.8 GB)

Name	Size	Download all
original_english.tar.gz md5:964c34e87b19e1a40b4ceec55a59fb60	6.8 GB	Download
original_spanish.tar.gz md5:5ca5e94fe72e9e51f1c6e9373d347bde	7.0 GB	Download

Additional details

Is supplement to: Publication: 10.1016/j.patcog.2025.112502 (DOI)

Repository URL: https://github.com/nachoDRT/MERIT-Dataset
Programming language: Python

	All versions	This version
Views	24	24
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Files (13.8 GB)

Related works

Software

The MERIT dataset: Modelling and efficiently rendering interpretable transcripts - Part 1 of 3

Authors/Creators

Description

Files

Files (13.8 GB)

Additional details

Related works

Software