PathOS Impact of Artefact Reuse in COVID-19 Publications Case Study Research Data, Code, and Analysis Results
Description
This repository contains the complete dataset, analysis scripts, and results for the Impact of Artefact Reuse in COVID-19 Publications case study.
Overview
This study investigates whether observable open science behaviors, specifically creating research artifacts that are subsequently reused by others, are associated with measurable downstream impact in COVID-19 research.
The analysis employs a regression-based approach using a filtered sample of 115,467 COVID-19 papers that created at least one dataset or software artifact and were cited at least once, ensuring all publications had potential for visibility and reuse.
Reusability is operationalized through reuse-artifact citances: citations where other papers explicitly reference and reuse datasets or software created by the original publication. This provides empirical evidence that artifacts were not only shared but also found useful and actionable in practice.
Repository Structure
covid/
├── README.md
├── complete_collection_df_fix.parquet
├── complete_collection_df_fix.xlsx
├── covid_create_collection.py
├── covid_calculate_indicators.py
├── covid_find_paper_affiliations.py
├── covid_find_paper_openaireids.py
├── covid_indicators_create_data_for_vis.py
└── results/
├── *.xlsx
├── *.txt
├── *.parquet
├── tables/
│ ├── 01_executive_summary.xlsx
│ ├── 02_impact_by_artifact_type.xlsx
│ └── ...
├── visualizations/
│ ├── 01_sample_overview.png
│ └── ...
└── final_visualization_data_figures/
├── figures/
└── data/
Data Description
Main Dataset (complete_collection_df_fix.parquet
/ .xlsx
)
The dataset includes COVID-19 research papers that created at least one research artifact and were cited at least once.
Key variable groups:
-
Paper identifiers:
id
,year
,citationcount
,authorcount
-
Artifact creation:
named_datasets_created
,unnamed_datasets_created
,named_software_created
,unnamed_software_created
,total_artifacts
-
Treatment variable:
has_reuse_artifact_citance
,reuse_artifact_inbound
-
Outcome variables: clinical trial/guideline citations (influential & non-influential),
patent_citations
,science_industry_collaboration
-
Control variables:
fwci
,interdisciplinarity_macro
,interdisciplinarity_meso
,science_industry_collaboration
-
Open access variables:
isopenaccess_oaire
,green
,bronze
,hybrid
,gold
,diamond
Scripts and Methodology
Core Analysis Scripts
-
covid_create_collection.py
– Data integration, indicator calculation, dataset creation -
covid_calculate_indicators.py
– Regression analysis, interaction effects, statistical outputs -
covid_find_paper_affiliations.py
– Affiliation and collaboration analysis -
covid_find_paper_openaireids.py
– OpenAIRE ID linkage -
covid_indicators_create_data_for_vis.py
– Visualization data and publication figures
External Data Sources
To fully reproduce the collection, large-scale data sources are required (not included here due to size/licensing):
-
Semantic Scholar Academic Graph
-
OpenAIRE Graph
-
PubMed (clinical trial & guideline classification)
-
PATSTAT (patent citations)
-
ROR (Research Organization Registry)
-
CORD-19 dataset
-
SciNoBo Toolkit (for interdisciplinarity, FWCI, citance, and artifact analysis)
The final processed dataset is provided, with all indicators and outcomes pre-computed.
Key Findings
COVID-19 papers with artifact reuse evidence show greater downstream impact:
-
More citations from clinical trial studies
-
More citations from clinical practice guidelines
-
Higher patent citations (innovation impact)
-
Increased science-industry collaboration
Results and Outputs
-
Executive summary:
results/tables/01_executive_summary.xlsx
-
Regression results:
results/regression_output_*.txt
-
Interaction effects:
results/tables/16-19_interaction_*.xlsx
-
Visualizations:
results/visualizations/
andresults/final_visualization_data_figures/
Usage Instructions
For statistical analysis:
-
Load dataset (
complete_collection_df_fix.parquet
/.xlsx
) -
Review summary statistics (
01_executive_summary.xlsx
) -
Inspect regression results (
regression_results_summary_covid.xlsx
) -
Explore interaction tables (16–19)
-
Use provided visualizations
For replication:
-
Configure all
PATH_TO_*
variables in scripts -
Install dependencies (
pandas
,statsmodels
,matplotlib
,seaborn
) -
Run
covid_calculate_indicators.py
-
Generate visualizations via
covid_indicators_create_data_for_vis.py
For extension:
-
Apply methods to other domains
-
Add new outcome variables
-
Modify treatment definitions or timeframes
-
Adapt regression framework to bibliometric studies
Sample Sizes and Coverage
-
Total COVID-19 papers analyzed: 115,467
-
Time period: Publications through 2021 (avoiding recent citation bias)
-
Coverage: Global (Semantic Scholar + OpenAIRE)
Quality Assurance
-
Multiple data validation steps
-
Robustness tests with interaction analyses
-
Documented and reproducible workflows
-
Best-practice statistical methods (controls, CIs, effect sizes)
Files
pathos_covid_19_case_study_files.zip
Files
(148.9 MB)
Name | Size | Download all |
---|---|---|
md5:e1a14d5cc886765025c1fc595bca10cf
|
148.9 MB | Preview Download |