---------------------------------
ReRx_drug_repurposing_multicohort
---------------------------------
This dataset provides scripts, data and other relevant information on systematic analysis of the results of the study entitled "Drug Repurposing for Parkinson’s Disease: A Large-Scale Multi-Cohort Study", which aims to identify drugs associated with the risk and progression of PD to provide insights into potential therapeutics and targets. We leveraged data from the Mass General Brigham (MGB) Biobank for discovery and the Accelerating Medicines Partnership Parkinson’s Disease (AMP-PD) program for replication. Specifically, two largest AMP-PD sub-cohorts (PPMI and PDBP) were integrated to forge the replication cohort.

This readme file was generated on 2025-04-14 by Yuxuan Hu with the help of Weiqiang Liu and Xianjun Dong (PI).

Xianjun Dong, PhD
Associate Professor, Departments of Neurology and Biomedical Informatics and Data Science
Yale School of Medicine’s Stephen & Denise Adams Center for Parkinson’s Disease Research
101 College Street
New Haven, CT 06510
Email: xianjun.dong@yale.edu
ORCID: 0000-0002-8052-9320

Date of Raw Data Collection: PPMI (2023-10-01); PDBP (2023-12-04); MGB Biobank (2024-06-21); atccodes (2024-12-01); export_Touchstone (2025-02-07)

Software Dependencies: R (4.4.1) with the necessary R packages as below.

library(broom)
library(broom.mixed)
library(cobalt)
library(DESeq2)
library(dplyr)
library(edgeR)
library(ggplot2)
library(ggrepel)
library(lme4)
library(lmerTest)
library(lubridate)
library(MatchIt)
library(Matrix)
library(meta)
library(openxlsx)
library(patchwork)
library(RColorBrewer)
library(readxl)
library(reshape2)
library(Seurat)
library(stringr)
library(tidyr)
library(tidyverse)

----------------
Analysis process 
----------------
Directory of Files: The default path of folders is "~/project/ReRx/". This dataset contains three main folders and included relevant files as shown in the directory below. Note: The raw data are available upon reasonable request, following approval by the MGB Biobank and AMP-PD. All file formats were also included in the directory. No special file naming convention.

--------------------------------------------------------------------
- data
  - raw_data
    - atccodes.csv
    - export_Touchstone.txt
    - MGB_Biobank [all raw files that were downloaded from databases]
    - PDBP [all raw files that were downloaded from databases]
    - PPMI [all raw files that were downloaded from databases]
  - processed_data
    - drug_dictionary_MGB.csv
    - drug_dictionary_update_PDBP.xlsx
    - drug_dictionary_update_PPMI.xlsx
- results
  - PD_risk_results_AMPPD.xlsx
  - PD_risk_results_MGB.xlsx
  - AMPPD_with_ATC_update.xlsx
  - MGB_with_ATC_update.xlsx
- scripts
  - Characteristic.R
  - Drug_dictionary.R
  - Medication.R
  - PD_progression.R
  - PD_risk.R
  - Replication.R
  - Sensitivity_analysis.R
  - Target_genes.R

- ReRx.txt
--------------------------------------------------------------------

1. Data cleaning of medication records
We clean up all the medications and unify them into generic drug names for downstream analysis. The input data includes medication files of PPMI (2023-10-01), PDBP (2023-12-04), and MGB Biobank (2024-06-21). First, Medication.R script is used to generate unique medication files, then we apply Drug_dictionary.R to map generic drug names for these medications. Finally, we used Medication.R script to regenerate the final medication files containing all the records for downstream analysis. The detailed name of medication files can be found in Medication.R script.

Data specific information for medication files:
(i) /raw_data/MGB_Biobank: EMPI[Identifier of participants]|Medication_Date[Date of medications]|Medication[Medication records]|Hospital[Hospital resources]|
(ii) /raw_data/PPMI: PATNO[Identifier of participants]|CMTRT[Medication records]|STARTDT[Date of medications]|
(iii) /raw_data/PDBP: Study_ID[Study ID of different resources]|PATNO[Identifier of participants]|CMTRT[Medication records]|EVENT_ID[Date of medications]|
(iiii) /processed_data/drug_dictionary_*: CMTRT[Medication records]|DrugName[Generic drug name of medications]

a. Medication.R
Prepare medication datasets (MGB Biobank and AMP-PD) for logistic regression and linear mixed-effect model. Three medication files and a processed file of (i)-(iiii) were input in the R script.

b. Drug_dictionary.R
Prepare drug dictionary (MGB Biobank and AMP-PD) for mapping medications and generic drug names by using Pubchem API. The mismatched medications will be added up manually based on the DrugBank database. The temporary file (iiii) was generated by this script.

2. Data analysis of associations between use of drugs and PD
We first implement a large-scale drug repurposing analysis using logistic regression to estimate the associations between drug exposure and the subsequent risk of PD. Then two final lists of replicated drugs were obtained, defined as those identified in both the discovery and replication cohorts using two replication methods (replicate by drug names and replicate by ATC codes). Sensitivity analyses addressed dose effects and reverse causality. We also analyze the longitudinal changes in cognitive and motor function using linear mixed-effects models. (i)-(iiiii) were input and (iiiiii)-(iiiiiiiii) were output.

a. PD_risk.R
This is the script with main part of analysis procedure: Propensity score matching (PSM), plot drawing of PSM and logistic regression.

b. Replication.R
This is the script to identify replicated drugs of the discovery cohort and the replication cohort in two methods, including replicate_by_drug_names and replicate_by_ATC_codes.

c. Sensitivity_analysis.R
This is the script to implement sensitivity analysis based on dose effects (drug exposure definition) and reverse causality (time window prior to the index date).

d. Characteristic.R
This is the script to calculate the demographic characteristics of the discovery cohort and the replication cohort.

e. PD_progression.R
This is the script with main part of analysis procedure: Linear mixed-effects regression model and plot drawing.

Data specific information for other files:
(i) /raw_data/MGB_Biobank: EMPI[Identifier of participants]|Date_of_Birth[Date of birth]|Age[Age of participants]|Gender_Legal_Sex[Sex of participants]|Race_Group[Race of participants]
(ii) /raw_data/PPMI: PATNO[Identifier of participants]|COHORT[Primary PD/HC group of participants]|CONCOHORT[Validated PD/HC group of participants]|subgroup[Detailed group of participants]|EVENT_ID[Date of records]|YEAR[Date of records]|age[Age at enrollment]|age_at_visit[Age at visit]|SEX[Sex of participants]|educ[Educational level of participants]|race[Race of participants]|moca[MoCA score -- cognition]|updrs3_score[UPDRS III score -- motor]|source[Source of participants]
(iii) /raw_data/PDBP: PATNO[Identifier of participants]|COHORT[Primary PD/HC group of participants]|CONCOHORT[Validated PD/HC group of participants]|subgroup[Detailed group of participants]|EVENT_ID[Date of records]|YEAR[Date of records]|age[Age at enrollment]|age_at_visit[Age at visit]|SEX[Sex of participants]|educ[Educational level of participants]|race[Race of participants]|moca[MoCA score -- cognition]|updrs3_score[UPDRS III score -- motor]|source[Source of participants]
(iiii) /raw_data/atccodes.csv: ATC.code[ATC codes of drugs]|Name[Name of drugs]
(iiiii) /raw_data/export_Touchstone.txt: Name[Name of drugs]|Target[Protein target genes of drugs]|MoA[Mechanism of action]
(iiiiii) /results/PD_risk_results_AMPPD.xlsx: DrugName[Name of drugs]|OR[Odds ratio]|CI_lower[Lower limitation of 95% confidence interval]|CI_upper[Upper limitation of 95% confidence interval]|p_value[p value]|Y_case[Cases that ever on drug]|Y_ctrl[Controls that ever on drug]|N_case[Cases that never on drug]|N_ctrl[Controls that never on drug]
(iiiiiii) /results/PD_risk_results_MGB.xlsx: DrugName[Name of drugs]|OR[Odds ratio]|CI_lower[Lower limitation of 95% confidence interval]|CI_upper[Upper limitation of 95% confidence interval]|p_value[p value]|Y_case[Cases that ever on drug]|Y_ctrl[Controls that ever on drug]|N_case[Cases that never on drug]|N_ctrl[Controls that never on drug]
(iiiiiiii) /results/AMPPD_with_ATC_update.xlsx: DrugName[Name of drugs]|ATC.code[ATC codes of drugs]
(iiiiiiiii) /results/MGB_with_ATC_update.xlsx: DrugName[Name of drugs]|ATC.code[ATC codes of drugs]

3. Data analysis of drug-target gene-cell relationship
We map drug target genes to cell types to explore potential therapeutic targets for PD. Please see the paper for more details.

a. Target_genes.R
This is the script to implement gene expression analysis of drug target genes in PD.

-----------------------
DATA ACCESS AND SHARING
-----------------------
Publications based on this dataset: pending
Recommended citation for this dataset: pending
License information: MIT license for GitHub