unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (full)
Description
Description
unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network.
The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.
Typical uses are
- Training of ML models (citation recommendation, summarization, LLMs)
- Citation context analysis
- Bibliographic analyses
Access
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
To download the whole data set send an access request and note the following:
Note: this Zenodo record is the "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹
Alternatively you can use the unarXive open subset.¹ For information on papers' licenses use arXiv's bulk metadata access.
The code for generating the data set is publicly available.
Files
Additional details
Related works
- Is described by
- Conference paper: 10.1109/JCDL57899.2023.00020 (DOI)
- Is new version of
- Dataset: 10.5281/zenodo.2553522 (DOI)
Subjects
- Natural Language Processing
- https://www.wikidata.org/wiki/Q30642
- Data Set
- https://www.wikidata.org/wiki/Q1172284
- unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata
- https://www.wikidata.org/wiki/Q106864121