Published January 19, 2024 | Version v2

CEX Project - Dataset and Gold Standard

  • 1. ROR icon University of Bologna

Description

The files represent the new version of Cioffi, A. (2022) Data for Testing and Evaluating References Extraction and Parsing Tools (1.0) https://doi.org/10.5281/zenodo.6182066. The 56 files of Cioffi, 2022 were corrected and additional 56 PDFs were manually annotated and aligned to the project guidelines. 
The TEI files of the whole dataset (112 documents) can be found under GoldStandard_TEI zip folder, while the PDFs are in GoldStandard_PDF folder.
GoldStandard.txt contains the list of the bibliographic references of each article. When an entry is marked as "Restricted", it means the correspondent PDF is not in Open Access format, thus it is not shared in the present publication and cannot be found in the GoldStandard_PDF folder, the other Open Access papers are shared in it.


The code can be found at Pagnotta, O. (2024). olgagolgan/CEX-Project: CEX Project Code (software). Zenodo. https://doi.org/10.5281/zenodo.10638757.

The output dataset of Anystyle, GROBID and OUTCITE can be found here  Pagnotta, O. (2024). CEX Project - Output Dataset (Anystyle, GROBID, OUTCITE) (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10524898.

The annotated training dataset for GROBID can be found here Pagnotta, O. (2024). CEX Project - GROBID annotation aligned Gold Standard (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10529646.

The trained GROBID citation models can be found here Pagnotta, O. (2024). CEX Project - trained GROBID citation models. Zenodo. https://doi.org/10.5281/zenodo.10529709.

Some results can be found here Pagnotta, O. (2023). Investigating the performance of GROBID and OUTCITE (Version v1). Zenodo. https://doi.org/10.5281/zenodo.10036455.

The final service can be found here Pagnotta, O. and Paolini, L. (2024). opencitations/cec: alpha version (service). Zenodo. https://doi.org/10.5281/zenodo.10635630.


The work is part of my Thesis research for the Digital Humanities and Digital Knowledge Master's Course at University of Bologna.

Files

GoldStandard.txt

Files (343.8 MB)

Name Size
md5:545a5f3120352fb7184fd94ef41c7763
32.7 kB Preview Download
md5:80e92cd14c7128088656c422b2224eca
343.2 MB Preview Download
md5:14be1fc2018145ff70d7181ba6c58988
635.9 kB Preview Download

Additional details