Published December 23, 2021 | Version 1.0
Dataset Open

Data for Cyrillic Reference Parsing

  • 1. Karlsruhe Institute of Technology (KIT)

Description

We provide a synthetic reference data set covering over 100,000 labeled references (mostly Russian language) and a manually annotated set of real references (771 in number) gathered from multidisciplinary Cyrillic script publications.

 

Background:

Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well-performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data.

The code for generating the data set is available at https://github.com/igor261/Sequence-Labeling-for-Citation-Field-Extraction-from-Cyrillic-Script-References

When using the data set, please cite the following paper:

Igor Shapiro, Tarek Saier, Michael Färber: "Sequence Labeling for Citation Field Extraction from Cyrillic Script References". In Proceedings of the AAAI-22 Workshop on Scientific Document Understanding (SDU@AAAI'22), 2022.

Files

Data_for_Cyrillic_Reference_Parsing.zip

Files (11.1 GB)

Name Size Download all
md5:c9149fd6a58ad32fed438cb5aecce886
11.1 GB Preview Download