Published July 23, 2020 | Version 1.0.0
Dataset Open

bioRxiv 10k

  • 1. eLife Sciences


This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available:

It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.

The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.

The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: (a working link is available from:

Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.


Files (28.8 GB)

Name Size Download all
5.7 GB Preview Download
17.2 GB Preview Download
5.8 GB Preview Download

Additional details

Related works

Is derived from (URL)
Journal article: 10.1145/2494266.2494271 (DOI)


  • Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177–180. ACM, New York (2013). doi: 10.1145/2494266.2494271