Published March 8, 2023 | Version 0.7.3
Dataset Open

GROBID end-to-end benchmarking datasets

Description

Here are the datasets used for GROBID end-to-end benchmarking covering:

- metadata extraction,

- bibliographical reference extraction, parsing and citation context identification, and

- full text body structuring.

The following collections are included:

- a PubMedCentral gold-standard dataset called PMC_sample_1943, compiled by Alexandru Constantin. The dataset, around 1.5GB in size, contains 1943 articles from 1943 different journals corresponding to the publications from a 2011 snapshot. For each article, we have a PDF file and a NLM XML file.

- a bioRxiv dataset called biorxiv-10k-test-2000 of 2000 preprint articles originally compiled with care and published by Daniel Ecer, available on Zenodo. The dataset contains for each article a PDF file and the corresponding reference NLM file (manually created by bioRxiv). The NLM files have been further systematically reviewed and annotated with additional markup corresponding to data and code availability statements and funding statements by the Grobid team. Around 5.4G in size.

- a set of 1000 PLOS articles, called PLOS_1000, randomly selected from the full PLOS Open Access collection. Again, for each article, the published PDF is available with the corresponding publisher JATS XML file, around 1.3GB total size.

- a set of 984 articles from eLife, called eLife_984, randomly selected from their open collection available on GitHub. Every articles come with the published PDF, the publisher JATS XML file and the eLife public HTML file (as bonus, not used), all in their latest version, around 4.5G total.

For each of these datasets, the directory structure is the same and documented here.

Further information on Grobid benchmarking and how to run it: https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/. Latest benchmarking scores are also available in the Grobid documentation: https://grobid.readthedocs.io/en/latest/Benchmarking/

These resources are originally published under CC-BY license. Our additional annotations are similarly under CC-BY.

We thank NIH, bioRxiv, PLOS and eLife for making these resources Open Access and reusable.

Files

biorxiv-10k-test-2000.zip

Files (13.3 GB)

Name Size Download all
md5:b6b2ac61bca174bbed949886c7970cc0
5.7 GB Preview Download
md5:9b11b8a74b0783d947be35a2193e0fca
4.8 GB Preview Download
md5:5891b80e17f3220739fe76ce81047282
1.3 GB Preview Download
md5:fcd5a22d02c55537f76768da8e736f46
1.5 GB Preview Download