Dataset Open Access

bioRxiv 10k

Daniel Ecer

This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: https://www.biorxiv.org/tdm

It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.

The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.

The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: https://doi.org/10.1145/2494266.2494271 (a working link is available from: https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/).

Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.

Files (28.8 GB)
Name Size
biorxiv-10k-test-2000.zip
md5:942cb97541b82440e74409e19e17d94d
5.7 GB Download
biorxiv-10k-train-6000.zip
md5:e4fefd52a2d480951d8514360395c34a
17.2 GB Download
biorxiv-10k-validation-2000.zip
md5:83dbf0d9ee7da1617afc319a38fc07a7
5.8 GB Download
  • Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177–180. ACM, New York (2013). doi: 10.1145/2494266.2494271

125
19,735
views
downloads
All versions This version
Views 125125
Downloads 19,73519,735
Data volume 262.6 TB262.6 TB
Unique views 113113
Unique downloads 1,5881,588

Share

Cite as