Text mining and analysis of "data available upon request" statements in scientific articles. (XMLs)
Authors/Creators
Description
This Zenodo record provides the full-text JATS XML files (from PMC Open Access subsuet) used to reproduce the text-mining analyses in the associated manuscript (2010–2025), focusing on “data available upon request” style statements in the genomics, genetics and bioinformatics literature.
This record accompanies the corresponding GitHub repository and the archived code release on Zenodo (Code). The derived tables and metadata used by the pipeline are distributed in a separate Zenodo record (Data).
Overview
The analysis pipeline detects “upon request” sentences in data availability statements, classifies them (vague versus linked to explicit access mechanisms or legitimate restrictions), and extracts additional open science indicators. This record provides the underlying full-text XML inputs required to reproduce the parsing and classification steps.
Files in this record
- 3.xml.zip
Archive containing shareable full-text JATS XML files used as input for the pipeline. This archive includes only XML files with an article-level license that permits redistribution (machine-readable Creative Commons licenses). Items with no license, a custom license, or other unclear redistribution terms are not included.
- xml_licenses.tsv
Manifest listing each redistributed XML file (PMCID) together with its license (as captured from the XML metadata).
- README_licence.txt
Short note explaining how licenses apply to the archived XML files.
Non-redistributable items where located in folder 3.no_cc_code/ and not redistributed.
Notes on licensing and redistribution (important)
The JATS XML files in 3.xml.zip are third-party content redistributed from the PubMed Central Open Access Subset. Each XML file is included only when its article-level license permits redistribution.
Reuse of each XML file is governed by its original license terms (as stated in the XML and, when applicable, on the article landing page). Licenses vary across files and include multiple Creative Commons variants (for example CC BY, CC BY-NC, CC BY-NC-ND, CC0). Non-commercial and no-derivatives restrictions apply where indicated.
The license selected for this Zenodo record does not modify or override the license terms attached to any included XML file. For transparency, xml_licenses.tsv provides the per-file license list for the entire archive.
Citation
If you use these XML files in academic work, please cite the associated manuscript and the Zenodo records below (code, XML corpus, and derived tables).
The full text JATS XML files in this record were obtained from the PubMed Central Open Access Subset and retrieved via the PMC OAI-PMH service. License terms vary by article; users are solely responsible for compliance with copyright restrictions and the terms defined by the copyright holder.
Please cite the source as: PMC Open Access Subset. Bethesda (MD): National Library of Medicine. 2003 - [cited YEAR MONTH DAY]. Available from https://pmc.ncbi.nlm.nih.gov/tools/openftlist/.
Manuscript:
Ballester, B. (2026). *From ‘data available upon request’ to accountable data access in genomics*. DOI: to be added.
Code (Zenodo):
Ballester, B. (2026). *Code: Text mining and analysis of “data available upon request” statements in scientific articles* (v1). Zenodo. https://doi.org/10.5281/zenodo.18339878
Full-text XML corpus (this record):
Ballester, B. (2026). *XML: PubMed Central Open Access Subset JATS XML files used for “data available upon request” analyses* (2010–2025). Zenodo. DOI: https://doi.org/10.5281/zenodo.18377386
Derived tables (2.data record):
Ballester, B. (2026). *Data: Text mining and analysis of “data available upon request” statements in scientific articles*. Zenodo. DOI: https://doi.org/10.5281/zenodo.18375259
Contact
Benoît Ballester (Aix Marseille Univ, INSERM, TAGC, UMR 1090, Marseille, France)
Files
README_licence.txt
Files
(3.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:7fa5cc44a851d3b61e040061f321b1b3
|
1.0 GB | Download |
|
md5:a3549518390f2a21b0223f21f1ee01e6
|
1.0 GB | Download |
|
md5:e642e7409c7db311cad2bc6bc3a338aa
|
1.0 GB | Download |
|
md5:1ebd1ea28a31c7619e08160337874948
|
674.8 MB | Download |
|
md5:97a4cd118e2b06361fd255dd7b4559f6
|
3.0 kB | Preview Download |
|
md5:267f59c8926eeb62da3cc764beb6f87a
|
7.5 MB | Download |