Published July 9, 2025 | Version v2
Model Open

Token files for the DANIEL (Document Attention Network for Information Extraction and Labeling)

  • 1. ROR icon Université de Rouen Normandie

Description

These files are required to execute the DANIEL code, which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

The contents of this archive must be extracted into the basic directory of the DANIEL codebase.

Contents of the archive:

  • tokenizer-daniel: This directory contains the tokenizer used by the DANIEL model, saved in the format of the HuggingFace tokenizers library.

  • replace_dict.pkl: This file contains a replacement dictionary used during the teacher forcing phase of training. It is designed to randomly substitute certain subwords with similar alternatives. Each key in the dictionary corresponds to a subword index from the DANIEL vocabulary, and each associated value is a list of indices representing the candidate subwords for replacement.

Citation Request

If you publish material based on this weights, we request you to include a reference to the paper:

« Constum, T., Tranouez, P. & Paquet, T., DANIEL: a fast document attention network for information extraction and labelling of handwritten documents. IJDAR (2025). https://doi.org/10.1007/s10032-024-00511-9 »

Files

subwords.zip

Files (18.7 MB)

Name Size Download all
md5:6a8120c32612b8905863b151c6dd6a73
17.1 kB Download
md5:2a2a7f1a10222f8462891ee38e05afde
18.6 MB Preview Download

Additional details

Related works

Dates

Available
2025-07-09

References