Dataset Open Access

bioRxiv 10k

Daniel Ecer


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: <a href=\"https://www.biorxiv.org/tdm\">https://www.biorxiv.org/tdm</a></p>\n\n<p>It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.</p>\n\n<p>The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.</p>\n\n<p>The zip is similar in structure to the &quot;PMC sample 1943&quot; dataset that was created as part of: <a href=\"https://doi.org/10.1145/2494266.2494271\">https://doi.org/10.1145/2494266.2494271</a> (a working link is available from: <a href=\"https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/\">https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/</a>).</p>\n\n<p>Therefore it is well suited for evaluation of <a href=\"https://github.com/elifesciences/sciencebeam/wiki/Related-Projects\">PDF to XML conversion tools</a>, such as <a href=\"https://github.com/kermitt2/grobid\">GROBID</a>. The dataset was created as part of <a href=\"https://elifesciences.org/\">eLife</a>&#39;s <a href=\"https://github.com/elifesciences/sciencebeam\">ScienceBeam</a> project.</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "@id": "https://orcid.org/0000-0003-0320-4300", 
      "@type": "Person", 
      "name": "Daniel Ecer"
    }
  ], 
  "url": "https://zenodo.org/record/3873702", 
  "citation": [
    {
      "@id": "https://doi.org/10.1145/2494266.2494271", 
      "@type": "ScholarlyArticle"
    }
  ], 
  "datePublished": "2020-07-23", 
  "version": "1.0.0", 
  "keywords": [
    "bioRxiv", 
    "PDF", 
    "XML", 
    "JATS"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde/biorxiv-10k-test-2000.zip", 
      "encodingFormat": "zip", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde/biorxiv-10k-train-6000.zip", 
      "encodingFormat": "zip", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde/biorxiv-10k-validation-2000.zip", 
      "encodingFormat": "zip", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.3873702", 
  "@id": "https://doi.org/10.5281/zenodo.3873702", 
  "@type": "Dataset", 
  "name": "bioRxiv 10k"
}
131
19,738
views
downloads
All versions This version
Views 131131
Downloads 19,73819,738
Data volume 262.6 TB262.6 TB
Unique views 118118
Unique downloads 1,5901,590

Share

Cite as