unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (open subset)

Saier, Tarek; Krause, Johan; Färber, Michael

doi:10.5281/zenodo.7752615

Published March 27, 2023 | Version v1

Dataset Open

unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (open subset)

1. Karlsruhe Institute of Technology

Description

unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network.

The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

Typical uses are

Training of ML models (citation recommendation, summarization, LLMs)
Citation context analysis
Bibliographic analyses

Access

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Regarding the full data set, please note the following:

Note: this Zenodo record is the "open subset" of unarXive, which contains all permissively licensed papers from arXiv.org. You can find the full version here.

The code used for generating the data set is publicly available.

Files

Files (4.8 GB)

Name	Size	Download all
LICENSE md5:b296faaab9c17d5874fd17967df54736	20.1 kB	Download
README md5:711e5552a59c7820fccf85d53e43812e	6.3 kB	Download
unarXive_230324_open_subset.tar.xz md5:4e9af00b730f1b8a680e4d7a47465395	4.8 GB	Download

Additional details

Is described by: Conference paper: 10.1109/JCDL57899.2023.00020 (DOI)
Is part of: Dataset: 10.5281/zenodo.7752754 (DOI)

Natural Language Processing: https://www.wikidata.org/wiki/Q30642
Data Set: https://www.wikidata.org/wiki/Q1172284
unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata: https://www.wikidata.org/wiki/Q106864121

	All versions	This version
Views	1,651	1,641
Downloads	1,071	1,067
Data volume	4.4 TB	4.3 TB

unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (open subset)

Description

Access

Files

Files (4.8 GB)

Additional details

Related works

Subjects

unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (open subset)

Creators

Description

Description

Access

Files

Files (4.8 GB)

Additional details

Related works

Subjects