Published September 3, 2025 | Version v1
Dataset Open

PathOS Impact of Artefact Reuse in COVID-19 Publications Case Study Research Data, Code, and Analysis Results

  • 1. ROR icon Institute for Language and Speech Processing
  • 2. ROR icon Athena Research and Innovation Center In Information Communication & Knowledge Technologies
  • 3. ROR icon OpenAIRE Non-Profit Civil Partnership

Description

 

This repository contains the complete dataset, analysis scripts, and results for the Impact of Artefact Reuse in COVID-19 Publications case study.

Overview

This study investigates whether observable open science behaviors, specifically creating research artifacts that are subsequently reused by others, are associated with measurable downstream impact in COVID-19 research.

The analysis employs a regression-based approach using a filtered sample of 115,467 COVID-19 papers that created at least one dataset or software artifact and were cited at least once, ensuring all publications had potential for visibility and reuse.

Reusability is operationalized through reuse-artifact citances: citations where other papers explicitly reference and reuse datasets or software created by the original publication. This provides empirical evidence that artifacts were not only shared but also found useful and actionable in practice.

Repository Structure

covid/
├── README.md
├── complete_collection_df_fix.parquet
├── complete_collection_df_fix.xlsx
├── covid_create_collection.py
├── covid_calculate_indicators.py
├── covid_find_paper_affiliations.py
├── covid_find_paper_openaireids.py
├── covid_indicators_create_data_for_vis.py
└── results/
    ├── *.xlsx
    ├── *.txt
    ├── *.parquet
    ├── tables/
    │   ├── 01_executive_summary.xlsx
    │   ├── 02_impact_by_artifact_type.xlsx
    │   └── ...
    ├── visualizations/
    │   ├── 01_sample_overview.png
    │   └── ...
    └── final_visualization_data_figures/
        ├── figures/
        └── data/

Data Description

Main Dataset (complete_collection_df_fix.parquet / .xlsx)

The dataset includes COVID-19 research papers that created at least one research artifact and were cited at least once.

Key variable groups:

  • Paper identifiers: id, year, citationcount, authorcount

  • Artifact creation: named_datasets_created, unnamed_datasets_created, named_software_created, unnamed_software_created, total_artifacts

  • Treatment variable: has_reuse_artifact_citance, reuse_artifact_inbound

  • Outcome variables: clinical trial/guideline citations (influential & non-influential), patent_citations, science_industry_collaboration

  • Control variables: fwci, interdisciplinarity_macro, interdisciplinarity_meso, science_industry_collaboration

  • Open access variables: isopenaccess_oaire, green, bronze, hybrid, gold, diamond

Scripts and Methodology

Core Analysis Scripts

  1. covid_create_collection.py – Data integration, indicator calculation, dataset creation

  2. covid_calculate_indicators.py – Regression analysis, interaction effects, statistical outputs

  3. covid_find_paper_affiliations.py – Affiliation and collaboration analysis

  4. covid_find_paper_openaireids.py – OpenAIRE ID linkage

  5. covid_indicators_create_data_for_vis.py – Visualization data and publication figures

External Data Sources

To fully reproduce the collection, large-scale data sources are required (not included here due to size/licensing):

  • Semantic Scholar Academic Graph

  • OpenAIRE Graph

  • PubMed (clinical trial & guideline classification)

  • PATSTAT (patent citations)

  • ROR (Research Organization Registry)

  • CORD-19 dataset

  • SciNoBo Toolkit (for interdisciplinarity, FWCI, citance, and artifact analysis)

The final processed dataset is provided, with all indicators and outcomes pre-computed.

Key Findings

COVID-19 papers with artifact reuse evidence show greater downstream impact:

  • More citations from clinical trial studies

  • More citations from clinical practice guidelines

  • Higher patent citations (innovation impact)

  • Increased science-industry collaboration

Results and Outputs

  • Executive summary: results/tables/01_executive_summary.xlsx

  • Regression results: results/regression_output_*.txt

  • Interaction effects: results/tables/16-19_interaction_*.xlsx

  • Visualizations: results/visualizations/ and results/final_visualization_data_figures/

Usage Instructions

For statistical analysis:

  1. Load dataset (complete_collection_df_fix.parquet / .xlsx)

  2. Review summary statistics (01_executive_summary.xlsx)

  3. Inspect regression results (regression_results_summary_covid.xlsx)

  4. Explore interaction tables (16–19)

  5. Use provided visualizations

For replication:

  1. Configure all PATH_TO_* variables in scripts

  2. Install dependencies (pandas, statsmodels, matplotlib, seaborn)

  3. Run covid_calculate_indicators.py

  4. Generate visualizations via covid_indicators_create_data_for_vis.py

For extension:

  • Apply methods to other domains

  • Add new outcome variables

  • Modify treatment definitions or timeframes

  • Adapt regression framework to bibliometric studies

Sample Sizes and Coverage

  • Total COVID-19 papers analyzed: 115,467

  • Time period: Publications through 2021 (avoiding recent citation bias)

  • Coverage: Global (Semantic Scholar + OpenAIRE)

Quality Assurance

  • Multiple data validation steps

  • Robustness tests with interaction analyses

  • Documented and reproducible workflows

  • Best-practice statistical methods (controls, CIs, effect sizes)

Files

pathos_covid_19_case_study_files.zip

Files (148.9 MB)

Name Size Download all
md5:e1a14d5cc886765025c1fc595bca10cf
148.9 MB Preview Download

Additional details

Funding

European Commission
PathOS - Open Science Impact Pathways 101058728