bioRxiv 10k

Published July 23, 2020 | Version 1.0.0

Dataset Open

This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: https://www.biorxiv.org/tdm

It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.

The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.

The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: https://doi.org/10.1145/2494266.2494271 (a working link is available from: https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/).

Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.

Files

Name	Size	Download all
biorxiv-10k-test-2000.zip md5:942cb97541b82440e74409e19e17d94d	5.7 GB	Preview Download
biorxiv-10k-train-6000.zip md5:e4fefd52a2d480951d8514360395c34a	17.2 GB	Preview Download
biorxiv-10k-validation-2000.zip md5:83dbf0d9ee7da1617afc319a38fc07a7	5.8 GB	Preview Download

Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177–180. ACM, New York (2013). doi: 10.1145/2494266.2494271