Dataset Open Access

bioRxiv 10k

Daniel Ecer

Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="" xmlns:oai_dc="" xmlns:xsi="" xsi:schemaLocation="">
  <dc:creator>Daniel Ecer</dc:creator>
  <dc:description>This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available:

It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.

The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.

The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: (a working link is available from:

Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.</dc:description>
  <dc:title>bioRxiv 10k</dc:title>
All versions This version
Views 131131
Downloads 19,73819,738
Data volume 262.6 TB262.6 TB
Unique views 118118
Unique downloads 1,5901,590


Cite as