Published January 26, 2026 | Version v1
Dataset Open

Text mining and analysis of "data available upon request" statements in scientific articles. (XMLs)

  • 1. ROR icon Inserm
  • 2. ROR icon Theories and Approaches of Genomic Complexity
  • 3. EDMO icon Aix-Marseille University

Description

This Zenodo record provides the full-text JATS XML files (from PMC Open Access subsuet) used to reproduce the text-mining analyses in the associated manuscript (2010–2025), focusing on “data available upon request” style statements in the genomics, genetics and bioinformatics literature.

This record accompanies the corresponding GitHub repository and the archived code release on Zenodo (Code). The derived tables and metadata used by the pipeline are distributed in a separate Zenodo record (Data).

Overview

The analysis pipeline detects “upon request” sentences in data availability statements, classifies them (vague versus linked to explicit access mechanisms or legitimate restrictions), and extracts additional open science indicators. This record provides the underlying full-text XML inputs required to reproduce the parsing and classification steps.

Files in this record

- 3.xml.zip  
  Archive containing shareable full-text JATS XML files used as input for the pipeline. This archive includes only XML files with an article-level license that permits redistribution (machine-readable Creative Commons licenses). Items with no license, a custom license, or other unclear redistribution terms are not included.

- xml_licenses.tsv  
  Manifest listing each redistributed XML file (PMCID) together with its license (as captured from the XML metadata).

- README_licence.txt  
  Short note explaining how licenses apply to the archived XML files.

Non-redistributable items where located in folder 3.no_cc_code/ and not redistributed.

Notes on licensing and redistribution (important)

The JATS XML files in 3.xml.zip are third-party content redistributed from the PubMed Central Open Access Subset. Each XML file is included only when its article-level license permits redistribution.

Reuse of each XML file is governed by its original license terms (as stated in the XML and, when applicable, on the article landing page). Licenses vary across files and include multiple Creative Commons variants (for example CC BY, CC BY-NC, CC BY-NC-ND, CC0). Non-commercial and no-derivatives restrictions apply where indicated.

The license selected for this Zenodo record does not modify or override the license terms attached to any included XML file. For transparency, xml_licenses.tsv provides the per-file license list for the entire archive.

Citation

If you use these XML files in academic work, please cite the associated manuscript and the Zenodo records below (code, XML corpus, and derived tables).

The full text JATS XML files in this record were obtained from the PubMed Central Open Access Subset and retrieved via the PMC OAI-PMH service. License terms vary by article; users are solely responsible for compliance with copyright restrictions and the terms defined by the copyright holder. 

Please cite the source as: PMC Open Access Subset. Bethesda (MD): National Library of Medicine. 2003 - [cited YEAR MONTH DAY]. Available from https://pmc.ncbi.nlm.nih.gov/tools/openftlist/

Manuscript
Ballester, B. (2026). *From ‘data available upon request’ to accountable data access in genomics*. DOI: to be added.

Code (Zenodo):
Ballester, B. (2026). *Code: Text mining and analysis of “data available upon request” statements in scientific articles* (v1). Zenodo. https://doi.org/10.5281/zenodo.18339878

Full-text XML corpus (this record):  
Ballester, B. (2026). *XML: PubMed Central Open Access Subset JATS XML files used for “data available upon request” analyses* (2010–2025). Zenodo. DOI: https://doi.org/10.5281/zenodo.18377386 

Derived tables (2.data record):  
Ballester, B. (2026). *Data: Text mining and analysis of “data available upon request” statements in scientific articles*. Zenodo. DOI: https://doi.org/10.5281/zenodo.18375259 

Contact

Benoît Ballester (Aix Marseille Univ, INSERM, TAGC, UMR 1090, Marseille, France)

Files

README_licence.txt

Files (3.8 GB)

Name Size Download all
md5:7fa5cc44a851d3b61e040061f321b1b3
1.0 GB Download
md5:a3549518390f2a21b0223f21f1ee01e6
1.0 GB Download
md5:e642e7409c7db311cad2bc6bc3a338aa
1.0 GB Download
md5:1ebd1ea28a31c7619e08160337874948
674.8 MB Download
md5:97a4cd118e2b06361fd255dd7b4559f6
3.0 kB Preview Download
md5:267f59c8926eeb62da3cc764beb6f87a
7.5 MB Download