Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published April 22, 2021 | Version 1.0.0
Dataset Open

Data for Training and Evaluating Metadata Extraction Models based on 15 Thousand Cyrillic Script Publications

  • 1. Institute AIFB, Karlsruhe Institute of Technology (KIT)

Description

Description
Data for training and evaluating sequence labeling models for metadata extraction based on 15,553 Cyrillic script language papers spanning 27 years and three languages.

For each paper, ground truth sequence labeling output is provided in TEI format and as annotated plain text.

 

The code used for creating and evaluating the data set can be found on GitHub.

For citing, you can refer to our paper introducing the data set:

@inproceedings{kssf-2021-cyrillic,
    title = {{Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic}},
    author = {Krause, Johan and
              Shapiro, Igor and
              Saier, Tarek and
              F{\"a}rber, Michael},
    booktitle = {Proceedings of the Second Workshop on Scholarly Document Processing},
    year = {2021}
}

 

Files

Files (17.8 MB)

Name Size Download all
md5:b11a8f6853eadbac8218f7bcc3da8bca
17.7 MB Download
md5:9dd6529fc4138a7799bb7c0aac21d9ab
20.0 kB Download
md5:4b1cfcc2dd7f741aa831233fc6ce19b4
648 Bytes Download