Published February 19, 2022 | Version 1.0
Other Open

Data for Testing and Evaluating References Extraction and Parsing Tools

Authors/Creators

  • 1. University of Bologna

Description

This work contains the data used to test and evaluate tools for references extraction from papers in PDF format. This work derives from my thesis on the references extraction and parsing tools, where the selected tools are: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. 

The folder PDF_papers contains the 56 papers in PDF used as input dataset. The names of the file are composed by the abridged form of the research field they belong to plus a numeric value which orders them from 1 to 54. As regards z_notes the numbering restarts from 0. These last two files are particular since they do not containing an explicitly named references section. They represent a further test for the tools. These papers have been selected from the work  “An Analysis of Citing and Referencing Habits across All Scholarly Disciplines: Approaches and Trends in Bibliographic Metadata Errors.”


The folder output_files contains seven sub-folders, one for each tool selected in the thesis, each containing the parsed references for each file of the input datase. Not all the folders contain 56 files since some tools were not able to return an output for all the input PDF papers.


The folder goldStand_parsed contains one folder containing the gold standard files of the input dataset (gold_standard_files) and another folder containing the parsed references converted to TEI XML (parsed_output_files). The conversion has been made with the codes published at https://doi.org/10.5281/zenodo.6182128.

Files

references_extraction_DATA.zip

Files (138.7 MB)

Name Size Download all
md5:8f2c300390606274d6c00026f1376ec7
138.7 MB Preview Download

Additional details

References

  • Santos, Erika Alves dos, Silvio Peroni, and Marcos Luiz Mucheroni. "An Analysis of Citing and Referencing Habits across All Scholarly Disciplines: Approaches and Trends in Bibliographic Metadata Errors." arXiv.org, February 17, 2022. https://doi.org/10.48550/arXiv.2202.08469