Data for Training and Evaluating Metadata Extraction Models based on 15 Thousand Cyrillic Script Publications

doi:10.5281/zenodo.4708696

Published April 22, 2021 | Version 1.0.0

Dataset Open

Data for Training and Evaluating Metadata Extraction Models based on 15 Thousand Cyrillic Script Publications

1. Institute AIFB, Karlsruhe Institute of Technology (KIT)

Description
Data for training and evaluating sequence labeling models for metadata extraction based on 15,553 Cyrillic script language papers spanning 27 years and three languages.

For each paper, ground truth sequence labeling output is provided in TEI format and as annotated plain text.

The code used for creating and evaluating the data set can be found on GitHub.

For citing, you can refer to our paper introducing the data set:

@inproceedings{kssf-2021-cyrillic,
    title = {{Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic}},
    author = {Krause, Johan and
              Shapiro, Igor and
              Saier, Tarek and
              F{\"a}rber, Michael},
    booktitle = {Proceedings of the Second Workshop on Scholarly Document Processing},
    year = {2021}
}

Files

Files (17.8 MB)

Name	Size	Download all
cyrillic_script_metadata_extraction.tar.bz2 md5:b11a8f6853eadbac8218f7bcc3da8bca	17.7 MB	Download
LICENSE md5:9dd6529fc4138a7799bb7c0aac21d9ab	20.0 kB	Download
README md5:4b1cfcc2dd7f741aa831233fc6ce19b4	648 Bytes	Download

	All versions	This version
Views	463	462
Downloads	59	59
Data volume	461.5 MB	461.5 MB

Data for Training and Evaluating Metadata Extraction Models based on 15 Thousand Cyrillic Script Publications

Creators

Description

Files

Files (17.8 MB)