Dataset Open Access

bioRxiv 10k

Daniel Ecer


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde/biorxiv-10k-test-2000.zip"
      }, 
      "checksum": "md5:942cb97541b82440e74409e19e17d94d", 
      "bucket": "02200c1c-2470-4ae9-80ac-47826ae19dde", 
      "key": "biorxiv-10k-test-2000.zip", 
      "type": "zip", 
      "size": 5738592606
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde/biorxiv-10k-train-6000.zip"
      }, 
      "checksum": "md5:e4fefd52a2d480951d8514360395c34a", 
      "bucket": "02200c1c-2470-4ae9-80ac-47826ae19dde", 
      "key": "biorxiv-10k-train-6000.zip", 
      "type": "zip", 
      "size": 17233894299
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde/biorxiv-10k-validation-2000.zip"
      }, 
      "checksum": "md5:83dbf0d9ee7da1617afc319a38fc07a7", 
      "bucket": "02200c1c-2470-4ae9-80ac-47826ae19dde", 
      "key": "biorxiv-10k-validation-2000.zip", 
      "type": "zip", 
      "size": 5811669057
    }
  ], 
  "owners": [
    104569
  ], 
  "doi": "10.5281/zenodo.3873702", 
  "stats": {
    "version_unique_downloads": 1590.0, 
    "unique_views": 118.0, 
    "views": 131.0, 
    "version_views": 131.0, 
    "unique_downloads": 1590.0, 
    "version_unique_views": 118.0, 
    "volume": 262621863634467.0, 
    "version_downloads": 19738.0, 
    "downloads": 19738.0, 
    "version_volume": 262621863634467.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.3873702", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.3873701", 
    "bucket": "https://zenodo.org/api/files/02200c1c-2470-4ae9-80ac-47826ae19dde", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.3873701.svg", 
    "html": "https://zenodo.org/record/3873702", 
    "latest_html": "https://zenodo.org/record/3873702", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.3873702.svg", 
    "latest": "https://zenodo.org/api/records/3873702"
  }, 
  "conceptdoi": "10.5281/zenodo.3873701", 
  "created": "2020-07-23T15:34:44.157708+00:00", 
  "updated": "2020-07-24T00:59:24.742598+00:00", 
  "conceptrecid": "3873701", 
  "revision": 2, 
  "id": 3873702, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.3873702", 
    "description": "<p>This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: <a href=\"https://www.biorxiv.org/tdm\">https://www.biorxiv.org/tdm</a></p>\n\n<p>It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total.</p>\n\n<p>The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset.</p>\n\n<p>The zip is similar in structure to the &quot;PMC sample 1943&quot; dataset that was created as part of: <a href=\"https://doi.org/10.1145/2494266.2494271\">https://doi.org/10.1145/2494266.2494271</a> (a working link is available from: <a href=\"https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/\">https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/</a>).</p>\n\n<p>Therefore it is well suited for evaluation of <a href=\"https://github.com/elifesciences/sciencebeam/wiki/Related-Projects\">PDF to XML conversion tools</a>, such as <a href=\"https://github.com/kermitt2/grobid\">GROBID</a>. The dataset was created as part of <a href=\"https://elifesciences.org/\">eLife</a>&#39;s <a href=\"https://github.com/elifesciences/sciencebeam\">ScienceBeam</a> project.</p>", 
    "language": "eng", 
    "title": "bioRxiv 10k", 
    "license": {
      "id": "CC-BY-4.0"
    }, 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "3873701"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "3873702"
          }
        }
      ]
    }, 
    "version": "1.0.0", 
    "references": [
      "Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177\u2013180. ACM, New York (2013). doi: 10.1145/2494266.2494271"
    ], 
    "keywords": [
      "bioRxiv", 
      "PDF", 
      "XML", 
      "JATS"
    ], 
    "publication_date": "2020-07-23", 
    "creators": [
      {
        "orcid": "0000-0003-0320-4300", 
        "name": "Daniel Ecer"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "url", 
        "identifier": "http://biorxiv.org/tdm", 
        "relation": "isDerivedFrom"
      }, 
      {
        "scheme": "doi", 
        "identifier": "10.1145/2494266.2494271", 
        "relation": "references", 
        "resource_type": "publication-article"
      }, 
      {
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.3873701", 
        "relation": "isVersionOf"
      }
    ]
  }
}
131
19,738
views
downloads
All versions This version
Views 131131
Downloads 19,73819,738
Data volume 262.6 TB262.6 TB
Unique views 118118
Unique downloads 1,5901,590

Share

Cite as