Dataset Open Access

Data for Training and Evaluating Metadata Extraction Models based on 15 Thousand Cyrillic Script Publications

Krause, Johan; Shapiro, Igor; Saier, Tarek; Färber, Michael

Description
Data for training and evaluating sequence labeling models for metadata extraction based on 15,553 Cyrillic script language papers spanning 27 years and three languages.

For each paper, ground truth sequence labeling output is provided in TEI format and as annotated plain text.

 

The code used for creating and evaluating the data set can be found on GitHub.

For citing, you can refer to our paper introducing the data set:

@inproceedings{kssf-2021-cyrillic,
    title = {{Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic}},
    author = {Krause, Johan and
              Shapiro, Igor and
              Saier, Tarek and
              F{\"a}rber, Michael},
    booktitle = {Proceedings of the Second Workshop on Scholarly Document Processing},
    year = {2021}
}

 

Files (17.8 MB)
Name Size
cyrillic_script_metadata_extraction.tar.bz2
md5:b11a8f6853eadbac8218f7bcc3da8bca
17.7 MB Download
LICENSE
md5:9dd6529fc4138a7799bb7c0aac21d9ab
20.0 kB Download
README
md5:4b1cfcc2dd7f741aa831233fc6ce19b4
648 Bytes Download
83
19
views
downloads
All versions This version
Views 8383
Downloads 1919
Data volume 177.5 MB177.5 MB
Unique views 6464
Unique downloads 1515

Share

Cite as