Dataset Open Access

# Zenodo metadata JSON records as of 2019-09-16

Stian Soiland-Reyes; Paul Groth

This preliminary dataset contains the application/vnd.zenodo.v1+json JSON records of Zenodo deposits as retrieved on 2019-09-16.

Files

• zenodo-records-json-2019-09-16.tar.xz Zenodo JSON records
XZ-compressed tar archive of individual JSON records as retrieved from Zenodo. Filenames reflects record, e.g. 1310621.json was retrieved from https://zenodo.org/api/records/1310621 using content-negotiation for application/vnd.zenodo.v1+json
• zenodo-records-json-2019-09-16-filtered.jsonseq.xz Concatinated Zenodo JSON records
XZ-compressed RFC7464 JSON Sequence stream, readable by jq. Concatination of Zenodo JSON records. Order not significant.
• zenodo-records.sh Retrieve Zenodo JSON records
A retrospectively created Bash shell script that shows the commands used to retrieve JSON files and concationate to jsonseq.
• ro-crate-metadata.jsonld RO-Crate 0.2 structured metadata
• ro-crate-preview.html Browser rendering of RO-Crate structured metadata
• README.md This dataset description

Copyright 2019 The University of Manchester

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0


Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

The Zenodo metadata in zenodo-records-json-2019-09-16.tar.xz is reused under the terms of https://creativecommons.org/publicdomain/zero/1.0/

Reproducibility

To retrieve the Zenodo JSON it was deemed necessary to use the undocumented parts of Zenodo API.

From the Zenodo source code it was identified that the REST template https://zenodo.org/api/records/{pid_value} could be used with pid_value as the numeric part from the OAI-PMH identifier, e.g. for oai:zenodo.org:1310621 the Zenodo JSON can be retrieved at https://zenodo.org/api/records/1310621.

The JSON API supports content negotiation, the content-types supported as of 2019-09-20 include:

• application/vnd.zenodo.v1+json giving the Zenodo record in Zenodo's internal JSON schema (v1)
• application/ld+json giving JSON-LD Linked Data using the http://schema.org/ vocabulary
• application/x-datacite-v41+xml giving DataCite v4 XML
• application/marcxml+xml giving MARC 21 XML

Using these (currently) undocumented parts of the Zenodo API thus avoids the need for HTML scraping while also giving individual complete records that are suitable to redistribute as records in a filtered dataset.

This preliminary exploration will be adapted into the reproducible CWL workflow, for now included as a Bash script zenodo-records.sh

Execution time was about 3 days from a server at the University of Manchester network on a single 1 GBps network link. The script does:

• Retrieve each of the first 3.5 million Zenodo records
as Zenodo JSON by iterating over possible numeric IDs (the maximum ID 3450000 was estimated from "Recent uploads")
• Filter list to exclude records that are not found, moved or deleted. The presence of the key conceptrecid is used as marker.
• Use jq to ensure the JSON is on a single line
• Join the JSON files using the ASCII Record Separator (RS, 0x1e) to make a application/json-seq JSON text sequence stream
• Save the JSON stream as a single compressed file using xz
Files (1.2 GB)
Name Size
md5:f671dfcfe957d6fd4f9dd7cbd87909d9
4.7 kB
md5:84df3d1a1a8cc1b1a13ff6674f1eac3b
8.0 kB
ro-crate-preview.html
md5:2c37a90c40c0acd9a4300e3e9136064d
22.6 kB
zenodo-records-json-2019-09-16-filtered.jsonseq.xz
md5:5d492275b05a7985bb0c1c2d8def1992
521.4 MB
zenodo-records-json-2019-09-16.tar.xz
md5:91a6c914dc4b1f33a8ccb59aeb2a905b
664.4 MB
zenodo-records.sh
md5:d5c32b8826a3370c064a8e5881ed6455
1.8 kB
235
255
views