Published July 23, 2020 | Version 1.0.0
Dataset Open

bioRxiv 10k

  • 1. eLife Sciences

Description

This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: https://www.biorxiv.org/tdm

It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.

The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.

The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: https://doi.org/10.1145/2494266.2494271 (a working link is available from: https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/).

Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.

Files

biorxiv-10k-test-2000.zip

Files (28.8 GB)

Name Size Download all
md5:942cb97541b82440e74409e19e17d94d
5.7 GB Preview Download
md5:e4fefd52a2d480951d8514360395c34a
17.2 GB Preview Download
md5:83dbf0d9ee7da1617afc319a38fc07a7
5.8 GB Preview Download

Additional details

Related works

Is derived from
http://biorxiv.org/tdm (URL)
References
Journal article: 10.1145/2494266.2494271 (DOI)

References

  • Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177–180. ACM, New York (2013). doi: 10.1145/2494266.2494271