Dataset Open Access

bioRxiv 10k

Daniel Ecer

This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available:

It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.

The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.

The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: (a working link is available from:

Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.

Files (28.8 GB)
Name Size
5.7 GB Download
17.2 GB Download
5.8 GB Download
  • Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177–180. ACM, New York (2013). doi: 10.1145/2494266.2494271

All versions This version
Views 125125
Downloads 19,73519,735
Data volume 262.6 TB262.6 TB
Unique views 113113
Unique downloads 1,5881,588


Cite as