Dataset Open Access

unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata

Saier, Tarek; Färber, Michael


JSON-LD (schema.org) Export

{
  "description": "<p>In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards.</p>\n\n<p>Here, we propose a new <strong>data set based on all publications from all scientific disciplines available on arXiv.org</strong>. Apart from providing the <strong>papers&#39; plain text</strong>, <strong>in-text citations were annotated</strong> via global identifiers. Furthermore, citing and cited publications were linked to the <strong>Microsoft Academic Graph</strong>, providing access to rich metadata. Our data set consists of <strong>over one million documents and 29.2 million citation contexts</strong>. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches but also serve as a basis for new ways to analyze in-text citations.</p>\n\n<p>This <strong>updated version</strong> (v3) of our data set is based on all arXiv publications until 2020-07-31 and on the Microsoft Academic Graph as of 2020-08-18. As additional contribution, we included a table with the publication date and the scientific discipline for each paper for easier filtering.</p>\n\n<p>See <a href=\"https://github.com/IllDepence/unarXive\">https://github.com/IllDepence/unarXive</a> for the <strong>source code</strong> which has been used for creating the data set.</p>\n\n<p><strong>Usage examples</strong> for our data set are provided at <a href=\"https://github.com/IllDepence/unarXive#usage-examples\">https://github.com/IllDepence/unarXive#usage-examples</a>.</p>\n\n<p>For <strong>citing</strong> our data set and for further information we can refer to our journal article</p>\n\n<p><em>Tarek Saier, Michael F&auml;rber: &quot;<a href=\"https://www.aifb.kit.edu/images/f/f9/UnarXive_Scientometrics2020.pdf\">unarXive: A Large Scholarly Data Set with Publications&rsquo; Full-Text, Annotated In-Text Citations, and Links to Metadata</a>&quot;, Scientometrics, 2020, <a href=\"http://dx.doi.org/10.1007/s11192-020-03382-z\">http://dx.doi.org/10.1007/s11192-020-03382-z</a>.</em></p>\n\n<p>&nbsp;</p>", 
  "license": "", 
  "creator": [
    {
      "affiliation": "University of Freiburg", 
      "@id": "https://orcid.org/0000-0001-5028-0109", 
      "@type": "Person", 
      "name": "Saier, Tarek"
    }, 
    {
      "affiliation": "University of Freiburg", 
      "@id": "https://orcid.org/0000-0001-5458-8645", 
      "@type": "Person", 
      "name": "F\u00e4rber, Michael"
    }
  ], 
  "url": "https://zenodo.org/record/4313164", 
  "datePublished": "2020-12-09", 
  "keywords": [
    "scholarly data", 
    "citations", 
    "papers", 
    "arXiv.org", 
    "digital libraries", 
    "dataset", 
    "scientometrics", 
    "full-text"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/efa917d9-ab49-4256-a452-78ee459401fd/unarXive-2020.tar.bz2", 
      "encodingFormat": "bz2", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.4313164", 
  "@id": "https://doi.org/10.5281/zenodo.4313164", 
  "@type": "Dataset", 
  "name": "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata"
}
3,549
24,220
views
downloads
All versions This version
Views 3,549800
Downloads 24,220450
Data volume 494.8 TB8.6 TB
Unique views 2,842700
Unique downloads 3,086278

Share

Cite as