Published April 25, 2024 | Version 1
Dataset Open

ArXiv OAI-PMH arXivRaw publication metadata

  • 1. German Aerospace Center

Description

This dataset contains OAI-PMH metadata for all ArXiv publications up until 2024-04-23 in the arXivRaw XML format.

The metadata has been harvested using the metha Go package v0.3.3 [1] on go1.18. Specifically, harvesting was run on a small HPC cluster using the following SLURM script. The script had to be scheduled twice due to the connection being reset by the peer (see combined-slurm.out). metha caters for these situations and is able to pick up where it left off with cumulative harvesting.

#!/bin/bash
#SBATCH --job-name=metha
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --time=10-20:00:00

module purge

echo "Installing Go module."
module add go/go-1.18/go-1.18-gcc-9.4.0-okbjyoy
echo "Installed Go module: $(go version)."

echo "Installing metha."
go install -v github.com/miku/metha/cmd/...@latest
echo "Installed metha: $(<retracted>/go/bin/metha-sync -v)"

echo "Harvesting ArXiv OAI-PMH metadata in format 'arXivRaw' from http://export.arxiv.org/oai2."
<retracted>/go/bin/metha-sync -T 5m -base-dir /scratch/<retracted>/arxiv -format "arXivRaw" http://export.arxiv.org/oai2
# For the second run, '-from' was specified to pick up the harvest where it was left off.
# <retracted>/go/bin/metha-sync -from 2020-09-29 -T 5m -base-dir /scratch/<retracted>/arxiv -format "arXivRaw" http://export.arxiv.org/oai2
echo "Done."
exit 0

Dataset contents

This deposit of the dataset contains the following files:

  • metha-output-OAI-PMH-arXivRaw-until-2024-03-24.tar.gz: an archive file containing the archive files (gzipped, *.xml.gz) produced by metha, which in turn contain the XML metadata files. The gzipped files contained in the archive are named following the pattern YYYY-MM-DD-<8-digit zero-padded 0-index file count>.xml.gz, e.g., 2024-03-24-00000001.xml.gz.
  • README.md: This file, containing basic information about the dataset and deposit.
  • combined-slurm.out: The combined SLURM log for the two consecutive SLURM runs that have produced the dataset. Run-specific information has been retracted.

Reproducibility

As the OAI-PMH metadata is not static but may change at any time, this dataset isn't fully reproducible. However, running the same metha version on the same go version with the same commands should yield very similar results, but will contain newer metadata.

Licenses

Licenses are documented in a machine-readble manner following the REUSE 3.0 Specification. License deeds are included in this deposit as .txt files named using the respective SPDX license identifiers.

[1] Martin Czygan, Thomas Gersch, ACz-UniBi, Justin Kelly, Gunnar Þór Magnússon, dvglc, & Natanael Arndt. (2024). miku/metha: v0.3.3 (v0.3.3). Zenodo. doi:10.5281/zenodo.10940212.

Files

README.md

Files (1.4 GB)

Name Size Download all
md5:65d3616852dbf7b1a6d4b53b00626032
7.0 kB Preview Download
md5:23c727eefd7c3e68871387fd30ea168d
645.4 kB Download
md5:8fea46822fe082d1e973af805461c82a
97 Bytes Download
md5:2aa23fc74ba78c255c75e986a2406394
1.4 GB Download
md5:8fea46822fe082d1e973af805461c82a
97 Bytes Download
md5:fb9033ee489e34bb43acf46c835bf342
3.3 kB Preview Download

Additional details

Dates

Submitted
2024-04-25

References