Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building
Description
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
- The term files contains a list of dictionaries containing filetype, size, and filename only.
- The term license contains a short Zenodo ID of the license (e.g. "cc-by").
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Files
Files
(3.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:78feabab3fa987d8117cba4d969489f4
|
339.5 MB | Download |
|
md5:8d309fb09fc8b0cf0e8479393144c6b9
|
11.1 MB | Download |
|
md5:a5c6656074d2967611a1b31c1a3ad3ea
|
3.1 GB | Download |
|
md5:6d8f988a4f77dd05e99f0935483f6b02
|
3.3 MB | Download |