Published November 2, 2022 | Version v3
Dataset Open

The impact of genetically controlled splicing on exon inclusion and protein structure

  • 1. Columbia University
  • 2. KTH University
  • 3. New York Genome Center


This repository contains raw and processed files used in Einson et. al 2022.

Code used to generate these files can be found here:

Descriptions of files contained within each sub directory


  • {GTEx_tissue_id}_v8.psi.tsv.gz: Unfiltered PSI output from IPSA-nf, per tissue. See methods for details about how files were created. 
  • gtex_v8_exon_id_map.tsv: Mapping file between exon coordinates and Ensembl gene IDs, with suffix used in GTEx v8 gencode annotation. 


  • cross_tissue
    • top_sQTLs_MAF05.tsv: List of top GTEx v8 sQTLs across tissues, with one exon and top variant per tissue. See methods for details. See matching file for column descriptions. 
    • top_sQTLs_median_psi.tsv: The median, mean, and standard deviation of PSI of each significant exon from the previous file, taken across all individuals from GTEx with data available.
    • top_sQTLs_MAF05_w_anc_allele.tsv: List of top sQTLs across tissues, with additional columns for the top ψQTL ancestral and derived alleles, where available. 
  • per_tissue
    • {GTEx_tissue_id}_combined_sQTLs.tsv.gz: Raw output of ψQTL calling using QTLtools in grouped permutational mode per tissue, with groups specified by gene. See methods for more details, and for column descriptions. 


  •  GTEx_psi_{GTEx_tissue_id}.collapsed.txt.gz: Output of the QTL catalog fine mapping pipeline (, run on all exons and tissues, and collapsed using the procedure described in Methods. 


  • combined_coloc_results_full.tsv.gz: Combined output of running coloc on ψQTLs from the 18 GTEx tissues against 87 sets of GWAS summary statistics. This file contains all results, including non-significant associations. A nominal QTLtools pass was used as input. We do not include these files in this repository due to size limitations, but contact the authors if you need access to nominal QTL calls. 
  • top_sQTLs_with_top_coloc_event.tsv: The QTLs in top_sQTLs_MAF05.tsv with additional columns for the GWAS with the highest posterior probability of a colocalization event. Importantly, the tissue and top variant may not match the main top_sQTLs_MAF05.tsv file for every gene. 

05_exon_features: See matching files for description of each column. 

  • cross_tissue_constitutive_exons_with_AF.tsv: Detailed features of cross tissue constitutive exons. See methods for definition of constitutive exons. 
  • cross_tissue_nonsignificant_genes_with_AF.tsv: Detailed features of sufficiently variable exons with no significant variant across tissues. See methods for more details. 
  • top_sQTLs_MAF05_with_AF.tsv: Detailed features of top sQTLs. 
  • top_sQTLs_with_top_coloc_with_AF.tsv: Detailed features of sQTLs that colocalize with at least one GWAS trait. Contains columns for Euclidean distances between PAE matrices and RMSD between isoforms, among genes with a significant GWAS colocalization event. 

06_predicted_structures: Each prediction was run 5 times, and we report the best model in the manuscript. 

  • {}[_mutant].result
    • {}[_mutant]{}_coverage.png.gz: Plot of the number of sequences per position in MSA
    • {}[_mutant]{}_PAE.png.gz: PAE matrix plots for each model
    • {}[_mutant]{}_plddt.png.gz: pLDDT plots for each model
    • {}[_mutant]{}_predicted_aligned_error_v1.json.gz: A PAE matrix for the best model using AlphaFold-DB's format
    • {}[_mutant]{}_unrelaxed_rank_{rank.num}_model_{model.num}_scores.json.gz: Per model array (list of lists) with PAE, a list of the average pLDDT and the pTM score. 
    • {}[_mutant]{}_unrelaxed_rank_{rank.num}_model_{model.num}_pdb.gz: Per model predicted structure in pd format
    • {}[_mutant]{}.a3m.gz: A3M formatted input MSA
    • cite.bibtex: BibTex file with citations for all used tools and databases
    • config.json: Model input parameters


  • cross_tissue_constitutive_exons.tsv: List of exons that are constitutively spliced across multiple tissues. See methods for details. 
  • cross_tissue_nonsignificant_genes.tsv: List variably spliced exons with no significant sVariant in any tissue. See methods for details. 
  • gtex_v8_exon_id_map.rds: rds representation of a map between exon IDs, as used in the modified version of gencode v26, and exon hg38 coordinates. 
  • gtex_v8_n_exons_per_gene.tsv: Number of exons per gene, as annotated in the modified version of gencode v26 used in GTEx v8. 


  • geuvadis_psi.tsv.gz: Unfiltered PSI output from IPSA-nf, run on Geuvadis BAM files. See methods for details. (Raw data was downloaded from
  • geuvadis_sQTLs.tsv.gz: Raw output of ψQTL calling using QTLtools in grouped permutational mode for geuvadis data, with groups specified by gene. See methods for more details, and for column descriptions. 
  • remapped_gencode.v26.GRCh37.GTEx_v8.nochr.genes.gtf.gz: Lifted over version of the gencode v26 gtf file, used to define exons for PSI and qtl mapping in the geuvadis analysis. The original version that was used in the GTEx analysis is based on GRCh38, and is available here:


