Published December 14, 2022 | Version v6
Dataset Open

Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building

Creators

  • 1. CERN

Description

This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

Records dataset

Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

  • The term files contains a list of dictionaries containing filetype, size, and filename only.
  • The term license contains a short Zenodo ID of the license (e.g. "cc-by").

Communities dataset

Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

Each object contains the terms: id, title, description, curation_policy, page 

which correspond to the fields with the same name available in Zenodo's community creation form.

Notes for all datasets

For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

Files

Files (3.5 GB)

Name Size Download all
md5:78feabab3fa987d8117cba4d969489f4
339.5 MB Download
md5:8d309fb09fc8b0cf0e8479393144c6b9
11.1 MB Download
md5:a5c6656074d2967611a1b31c1a3ad3ea
3.1 GB Download
md5:6d8f988a4f77dd05e99f0935483f6b02
3.3 MB Download