Data for Testing and Evaluating References Extraction and Parsing Tools
Description
This work contains the data used to test and evaluate tools for references extraction from papers in PDF format. This work derives from my thesis on the references extraction and parsing tools, where the selected tools are: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse.
The folder PDF_papers contains the 56 papers in PDF used as input dataset. The names of the file are composed by the abridged form of the research field they belong to plus a numeric value which orders them from 1 to 54. As regards z_notes the numbering restarts from 0. These last two files are particular since they do not containing an explicitly named references section. They represent a further test for the tools. These papers have been selected from the work “An Analysis of Citing and Referencing Habits across All Scholarly Disciplines: Approaches and Trends in Bibliographic Metadata Errors.”
The folder output_files contains seven sub-folders, one for each tool selected in the thesis, each containing the parsed references for each file of the input datase. Not all the folders contain 56 files since some tools were not able to return an output for all the input PDF papers.
The folder goldStand_parsed contains one folder containing the gold standard files of the input dataset (gold_standard_files) and another folder containing the parsed references converted to TEI XML (parsed_output_files). The conversion has been made with the codes published at https://doi.org/10.5281/zenodo.6182128.
Files
references_extraction_DATA.zip
Files
(138.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:8f2c300390606274d6c00026f1376ec7
|
138.7 MB | Preview Download |
Additional details
References
- Santos, Erika Alves dos, Silvio Peroni, and Marcos Luiz Mucheroni. "An Analysis of Citing and Referencing Habits across All Scholarly Disciplines: Approaches and Trends in Bibliographic Metadata Errors." arXiv.org, February 17, 2022. https://doi.org/10.48550/arXiv.2202.08469