﻿This file contains the data and scripts for the preprint "Large language models identify causal genes in complex trait GWAS" 

Folder structure

* .env.example - Example of a .env file used to store API keys for OpenAI LLM access

* .. data/ 
   * benchmark_datasets 
      * <dataset>_step1.tsv - Original dataset filtered by criteria described, with annotation of EFO phenotype added where necessary
      * <dataset>_step2.for_llm.tsv - Input parameters to LLM (phenotype name, list of genes within 500 kbp of index variant)
      *  <dataset>_step2.labels - Ground-truth causal-gene labels
      * For datasets
         * Opentargets
         * Pharmaprojects
         * GWAS catalog
         * (Not included) weeks et al. 
      * Scrambled datasets for all 4 datasets
   * helper_datasets 
      * gene_embeddings.csv - 3,072-dimensional embeddings of genes used in our analysis 
      * phenotype_embeddings.csv - 3,072-dimensional embeddings of phenotypes used in our analysis 
      * gene_list - List of all genes in our analysis  
      * pheno_list  - List of phenotypes used in analysis
      * publication_count_by_gene.txt - Publication count by gene ENSEMBL id
      * UKBB_94traits_release1.traits.efo_tagged - information about the traits in the Weeks et al data, with EFO tags added
* .. results
   * predictions 
      * <dataset>.<method>.csv - Predictions for <dataset> using <method>
         * For all 4 datasets, including weeks et al
         * Methods = nearest_gene, text_mining_gene, pops, L2G, gpt3_zero_shot, gpt3_zero_shot_minimal, gpt4_1106_zero_shot, gpt4_0613_zero_shot
   * others 
      *  <dataset>.embedding_info.csv - Rank of causal gene from embeddings for all 4 datasets
* .. scripts
   * llm_caller
      * run_llm_v2g.py - Run LLM-based causal gene prediction
      * prompt_utils.py - File with utility functions to describe LLM prompts 
      * create_embedding.py - Generate embeddings for any new entities
      * utils.py - File with other utility function
   * evaluation
      * create_figures_and_tables.R - R script to recreate figures and tables
      * utils.R - File with utility functions for evaluation
   * data_generation
      * process_weeks_et_al_both_steps.R - R script to generate the step1 and step2 files in the benchmark datasets from the raw input files from Weeks et al.


Notes
* Get UKB_AllMethods_GenePrioritization.txt from https://www.dropbox.com/sh/o6t5jprvxb8b500/AACqCux_jJbF9F56ozhzzkpia/results/UKB_AllMethods_GenePrioritization.txt.gz?dl=0 and place it in data/helper_datasets folder 
* Get Weeks_et_al raw data from the authors (file: PoPS_UKBB_noncoding_validation_1348CSs.txt.gz) and place it in data/helper_datasets folder
