Published September 3, 2025 | Version v1
Dataset Open

PathOS Impact of Open Access Routes on Topic Persistence Case Study Research Data, Code, and Analysis Results

  • 1. ROR icon Institute for Language and Speech Processing
  • 2. ROR icon Athena Research and Innovation Center In Information Communication & Knowledge Technologies
  • 3. ROR icon OpenAIRE Non-Profit Civil Partnership

Description

 

This repository contains the data, scripts, and results for the Impact of Open Access Routes on Topic Persistence case study, part of the PATHOS project.

Overview

Artificial intelligence methods are being rapidly mobilized to tackle the climate crisis, but the knowledge base often burns bright and fades quickly. This case study asks whether two distinct Open Access (OA) routes help AI-for-Climate research topics stay active in the literature:

  • Green OA: self-archiving in repositories

  • Published OA: journal-mediated open access with a clear licence

Bronze OA and dual-mode publications are excluded for treatment clarity. Closed Access (CA) articles serve as the counterfactual.

By foregrounding topic persistence as a key dimension of impact, the study goes beyond short-term citation counts and investigates whether openness helps research topics remain visible long enough to demonstrate their potential.

Repository Structure

├── README.md
├── fos_taxonomy_v0.1.2.json
├── persistent_topics_create_collection.py
├── persistent_topics_find_paper_openaireids.py
├── persistent_topics_find_paper_affiliations.py
├── persistent_topics_get_collection_author_gender.py
├── persistent_topics_calculate_indicators.py
├── persistent_topics_calculate_indicators_sdg.py
├── persistent_topics_indicators_create_data_for_vis.py
└── persistent_topics_collection_w_outcomes/
    ├── complete_collection_df.parquet / .xlsx
    ├── topic_attribution_df.parquet / .xlsx
    ├── results/
    │   ├── analysis_conclusions.txt
    │   ├── summary_statistics.xlsx
    │   ├── treatment_effects_green_oa.xlsx
    │   ├── treatment_effects_published_oa.xlsx
    │   ├── descriptive_effects_any_oa.xlsx
    │   ├── tables/
    │   │   ├── 01_executive_summary.xlsx
    │   │   ├── 02_treatment_group_characteristics.xlsx
    │   │   ├── 03_causal_effects_summary.xlsx
    │   │   ├── 04_topic_persistence_analysis.xlsx
    │   │   ├── 05_gender_equity_outcomes.xlsx
    │   │   ├── 06_economic_impact_analysis.xlsx
    │   │   ├── 07_publication_year_analysis.xlsx
    │   │   └── 08_robustness_analysis.xlsx
    │   ├── visualizations/
    │   │   ├── 01_sample_overview.png
    │   │   ├── 02_causal_effects.png
    │   │   ├── 03_outcome_analysis.png
    │   │   └── 04_temporal_and_balance.png
    │   └── final_visualization_data_figures/
    │       ├── data/
    │       └── figures/
    └── results_sdg_only/
        ├── sdg_analysis_conclusions.txt
        ├── green_matched_sdg_papers.xlsx
        ├── published_matched_sdg_papers.xlsx
        ├── closed_matched_a_sdg_papers.xlsx
        ├── closed_matched_b_sdg_papers.xlsx
        ├── tables/
        │   ├── 01_sdg_distribution_matched_samples.xlsx
        │   ├── 02_sdg_treatment_effects.xlsx
        │   ├── 03_sdg_vs_non_sdg_comparison.xlsx
        │   ├── 04_sdg_categories_by_impact.xlsx
        │   ├── 05_sdg_gender_industry_collaboration.xlsx
        │   ├── 06_sdg_analysis_summary.xlsx
        │   ├── 07_sdg_alignment_comparison_matched.xlsx
        │   └── 08_sdg_alignment_effects_summary.xlsx
        └── visualizations/
            ├── 01_sdg_distribution_overview.png
            ├── 02_sdg_treatment_effects.png
            ├── 03_sdg_impact_analysis.png
            └── 04_sdg_alignment_comparison_matched.png

Data Sources

External Data Sources (not included)

  • Semantic Scholar Academic Graph: full publication metadata

  • OpenAIRE Graph: European research infrastructure data

  • PATSTAT: patent database for citation analysis

  • ROR: Research Organization Registry

  • SciNoBo toolkit: FOS classification, interdisciplinarity, SDG mapping, FWCI scores

Included Data

  • Complete processed collection with outcomes

  • Topic attribution dataset (paper-topic mappings, persistence scores)

  • Analysis results: matched samples, treatment effects, summary statistics

  • SciNoBo Field of Science taxonomy (fos_taxonomy_v0.1.2.json)

Scripts

Data Processing

  • persistent_topics_create_collection.py – integrates multiple data sources, outcomes, affiliations, patent citations

  • persistent_topics_find_paper_openaireids.py – maps DOIs to OpenAIRE IDs

  • persistent_topics_find_paper_affiliations.py – extracts affiliations, science-industry collaboration

  • persistent_topics_get_collection_author_gender.py – gender classification of authors

Analysis

  • persistent_topics_calculate_indicators.py – main causal inference analysis (PSM for Green OA vs CA, Published OA vs CA)

  • persistent_topics_calculate_indicators_sdg.py – SDG-focused treatment effects

  • persistent_topics_indicators_create_data_for_vis.py – prepares final visualization datasets and figures

Key Findings

Sample

  • Total: 132,134 papers (2000–2021)

  • Green OA: 3,792 papers

  • Published OA: 19,045 papers

  • Closed Access: 92,998 papers

Contributions

  1. New Topic Persistence Metric for long-term impact

  2. Clean OA treatment definitions (excluding dual-mode and Bronze)

  3. Separate analysis of Green vs Published OA pathways

Main Results

  • 8 significant causal effects across outcomes

  • Enhanced topic persistence in OA papers

  • Positive gender equity outcomes

  • Evidence of economic impact (patents, collaborations)

SDG Findings

  • 24,948 SDG-relevant papers (18.9% of sample)

  • 11 significant treatment effects for SDG-related research

  • Stronger knowledge sustainability for achieving SDG goals

Methodology

Design

  • Propensity Score Matching (PSM) with balanced covariates

  • Separate analyses for Green OA vs CA and Published OA vs CA

  • Robust outcome metrics (including new persistence measure)

Treatment Definitions

  • Green OA: repository-based

  • Published OA: journal-based (gold, hybrid, diamond)

  • Closed Access: no open provision

  • Excluded: dual-mode and Bronze OA

Outcomes

  1. Citation impact (traditional)

  2. Topic persistence (novel metric)

  3. Gender equity in authorship

  4. Economic impact (patents, collaboration)

  5. Field effects (disciplinary and SDG)

Files

pathos_persistent_topics_case_study_files.zip

Files (48.8 MB)

Name Size Download all
md5:c2b35dab093b53ccdf1ebfa2fa3b89a5
48.8 MB Preview Download

Additional details

Funding

European Commission
PathOS - Open Science Impact Pathways 101058728