Saier, Tarek
Färber, Michael
2020-12-09
<h2><strong>Description</strong></h2><p><strong>unarXive</strong> is a scholarly data set containing <strong>publications' full-text</strong>, annotated <strong>in-text citations</strong>, and a <strong>citation network</strong>.</p><p>The data is <strong>generated from all LaTeX sources on </strong><a href="https://arxiv.org/"><strong>arXiv</strong></a> and therefore of higher quality than data generated from PDF files.</p><p>Typical <strong>use cases</strong> are</p><ul><li>Citation recommendation</li><li>Citation context analysis</li><li>Bibliographic analyses</li><li>Reference string parsing</li></ul><p>This version (v3) of our data set is based on all arXiv publications until 2020-07-31 and on the Microsoft Academic Graph as of 2020-08-18. As additional contribution, we included a table with the publication date and the scientific discipline for each paper for easier filtering.</p><p><strong>Note:</strong> This Zenodo record is an old version of unarXive. You can find the most recent version at <a href="https://zenodo.org/record/7752754">https://zenodo.org/record/7752754</a> and <a href="https://zenodo.org/record/7752615">https://zenodo.org/record/7752615</a></p><h2><strong>Access</strong></h2><p>┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓<br>┃ <a href="https://github.com/IllDepence/unarXive/blob/legacy_2020/doc/unarXive_sample.tar.bz2"><strong>D O W N L O A D S A M P L E</strong></a> ┃<br>┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛</p><p>To download the whole data set send an access request and note the following:</p><blockquote><p><strong>Note</strong>: this Zenodo record is the "full" version of unarXive, which was generated from all of arXiv.org <i>including non-permissively licensed papers</i>. Make sure that your use of the data is compliant with the paper's licensing terms.¹</p><p>¹ For information on papers' licenses use <a href="https://info.arxiv.org/help/bulk_data/index.html">arXiv's bulk metadata access</a>.</p></blockquote><p>The <strong>code</strong> used for generating the data set is <a href="https://github.com/IllDepence/unarXive/tree/legacy_2020/">publicly available</a>.</p><p><strong>Usage examples</strong> for our data set are provided at <a href="https://github.com/IllDepence/unarXive/tree/legacy_2020/#usage-examples">here on GitHub</a>.</p><h2><strong>Citing</strong></h2><p>This initial version of unarXive is described in the following journal article.</p><p><i>Tarek Saier, Michael Färber: "</i><a href="http://dx.doi.org/10.1007/s11192-020-03382-z"><i>unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata</i></a><i>", Scientometrics, 2020,</i><br>[<a href="https://www.aifb.kit.edu/images/f/f9/UnarXive_Scientometrics2020.pdf">link to an author copy]</a></p><p>The <strong>updated version</strong> is described in the following conference paper.</p><p><i>Tarek Saier, Michael Färber. "</i><a href="10.1109/JCDL57899.2023.00020"><i>unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network</i></a><i>", JCDL 2023.</i><br>[<a href="https://doi.org/10.48550/arXiv.2303.14957">link to an author copy</a>]</p>
https://doi.org/10.5281/zenodo.4313164
oai:zenodo.org:4313164
Zenodo
https://doi.org/10.1007/s11192-020-03382-z
https://doi.org/10.5281/zenodo.7752754
https://zenodo.org/communities/natural-language-processing
https://zenodo.org/communities/bibliometrics
https://zenodo.org/communities/scholarly-data
https://doi.org/10.5281/zenodo.2553522
info:eu-repo/semantics/restrictedAccess
Other (Attribution)
scholarly data
citations
papers
arXiv.org
digital libraries
dataset
scientometrics
full-text
unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata
info:eu-repo/semantics/other