Dataset Open Access

bioRxiv 10k

Daniel Ecer


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3873702", 
  "language": "eng", 
  "title": "bioRxiv 10k", 
  "issued": {
    "date-parts": [
      [
        2020, 
        7, 
        23
      ]
    ]
  }, 
  "abstract": "<p>This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: <a href=\"https://www.biorxiv.org/tdm\">https://www.biorxiv.org/tdm</a></p>\n\n<p>It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.</p>\n\n<p>The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.</p>\n\n<p>The zip is similar in structure to the &quot;PMC sample 1943&quot; dataset that was created as part of: <a href=\"https://doi.org/10.1145/2494266.2494271\">https://doi.org/10.1145/2494266.2494271</a> (a working link is available from: <a href=\"https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/\">https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/</a>).</p>\n\n<p>Therefore it is well suited for evaluation of <a href=\"https://github.com/elifesciences/sciencebeam/wiki/Related-Projects\">PDF to XML conversion tools</a>, such as <a href=\"https://github.com/kermitt2/grobid\">GROBID</a>. The dataset was created as part of <a href=\"https://elifesciences.org/\">eLife</a>&#39;s <a href=\"https://github.com/elifesciences/sciencebeam\">ScienceBeam</a> project.</p>", 
  "author": [
    {
      "family": "Daniel Ecer"
    }
  ], 
  "version": "1.0.0", 
  "type": "dataset", 
  "id": "3873702"
}
131
19,738
views
downloads
All versions This version
Views 131131
Downloads 19,73819,738
Data volume 262.6 TB262.6 TB
Unique views 118118
Unique downloads 1,5901,590

Share

Cite as