ArXiv OAI-PMH arXivRaw publication metadata
Description
This dataset contains OAI-PMH metadata for all ArXiv publications up until 2024-04-23 in the arXivRaw XML format.
The metadata has been harvested using the metha Go package v0.3.3 [1] on go1.18. Specifically, harvesting was run on a small HPC cluster using the following SLURM script. The script had to be scheduled twice due to the connection being reset by the peer (see combined-slurm.out). metha caters for these situations and is able to pick up where it left off with cumulative harvesting.
#!/bin/bash
#SBATCH --job-name=metha
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --time=10-20:00:00
module purge
echo "Installing Go module."
module add go/go-1.18/go-1.18-gcc-9.4.0-okbjyoy
echo "Installed Go module: $(go version)."
echo "Installing metha."
go install -v github.com/miku/metha/cmd/...@latest
echo "Installed metha: $(<retracted>/go/bin/metha-sync -v)"
echo "Harvesting ArXiv OAI-PMH metadata in format 'arXivRaw' from http://export.arxiv.org/oai2."
<retracted>/go/bin/metha-sync -T 5m -base-dir /scratch/<retracted>/arxiv -format "arXivRaw" http://export.arxiv.org/oai2
# For the second run, '-from' was specified to pick up the harvest where it was left off.
# <retracted>/go/bin/metha-sync -from 2020-09-29 -T 5m -base-dir /scratch/<retracted>/arxiv -format "arXivRaw" http://export.arxiv.org/oai2
echo "Done."
exit 0
Dataset contents
This deposit of the dataset contains the following files:
- metha-output-OAI-PMH-arXivRaw-until-2024-03-24.tar.gz: an archive file containing the archive files (gzipped, *.xml.gz) produced by metha, which in turn contain the XML metadata files. The gzipped files contained in the archive are named following the pattern YYYY-MM-DD-<8-digit zero-padded 0-index file count>.xml.gz, e.g., 2024-03-24-00000001.xml.gz.
- README.md: This file, containing basic information about the dataset and deposit.
- combined-slurm.out: The combined SLURM log for the two consecutive SLURM runs that have produced the dataset. Run-specific information has been retracted.
Reproducibility
As the OAI-PMH metadata is not static but may change at any time, this dataset isn't fully reproducible. However, running the same metha version on the same go version with the same commands should yield very similar results, but will contain newer metadata.
Licenses
- All ArXiv OAI-PMH metadata is licensed under CC0-1.0.
- combined-slurm.out is licensed under CC0-1.0.
- README.md is licensed under CC0-1.0.
Licenses are documented in a machine-readble manner following the REUSE 3.0 Specification. License deeds are included in this deposit as .txt files named using the respective SPDX license identifiers.
[1] Martin Czygan, Thomas Gersch, ACz-UniBi, Justin Kelly, Gunnar Þór Magnússon, dvglc, & Natanael Arndt. (2024). miku/metha: v0.3.3 (v0.3.3). Zenodo. doi:10.5281/zenodo.10940212.
Files
README.md
Files
(1.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:65d3616852dbf7b1a6d4b53b00626032
|
7.0 kB | Preview Download |
|
md5:23c727eefd7c3e68871387fd30ea168d
|
645.4 kB | Download |
|
md5:8fea46822fe082d1e973af805461c82a
|
97 Bytes | Download |
|
md5:2aa23fc74ba78c255c75e986a2406394
|
1.4 GB | Download |
|
md5:8fea46822fe082d1e973af805461c82a
|
97 Bytes | Download |
|
md5:fb9033ee489e34bb43acf46c835bf342
|
3.3 kB | Preview Download |
Additional details
Dates
- Submitted
-
2024-04-25
References
- Martin Czygan, Thomas Gersch, ACz-UniBi, Justin Kelly, Gunnar Þór Magnússon, dvglc, & Natanael Arndt. (2024). miku/metha: v0.3.3 (v0.3.3). Zenodo. https://doi.org/10.5281/zenodo.10940212