Published May 14, 2024 | Version v1
Dataset Open

Paired datasets to study alternative splicing regulation by individual RNA-binding proteins

  • 1. LASIGE
  • 2. Instituto de Medicina Molecular João Lobo Antunes

Description

This project stores datasets generated to study the regulation of alternative splicing using deep learning models (e.g., SpliceAI). In particular, these datasets were used to perform ablation studies (sequence perturbations at motif locations) to evaluate their effects on the deep learning model.

used public RNA-Seq data from the ENCODE consortium to identify exons sensitive to the knockdown of RNA-binding proteins (RBPs). The idea is that exons sensitive to RBP knockdowns are more likely to be directly or indirectly regulated by such RBPs, hence providing hints on their regulation mechanisms. Importantly, I also generated paired control exons, which were not alternatively spliced upon RBP knockdown but have similar GC composition and length compared to the knockdown-sensitive exons (target exon and surrounding introns). These control sets were generated to account for potential confounding factors of gene architecture features and, therefore, focus only on RBP binding motifs and their regulatory logic.

Information about the files

After uncompressing the 'paired_dataset.tar.gz' file, a directory with multiple files will be created with the following structure:

  • 0_rMATS_ES_events.tsv.gz: Summary tables of differential splicing analysis, with deltaPSI estimates referring to Ctrl - Knockdown groups. Important columns: 'target_coordinates' refers to the 1-based coordinates of the alternatively spliced exon, and 'group' indicates the individual knockdown experiments where the exon was observed to be alternatively spliced.
  • 0_rMATs_ES_non_changing_events.tsv.gz: Summary tables of differential splicing analysis, but in this case, contains all non-changing events (dPSI < |0.025|).
  • 1_KD_exons_dPSI0.1.tsv.gz: Table with knockdown-sensitive exons along with values for gene architecture features along the exon triplet (exon upstream, intron upstream, cassette exon, intron downstream exon downstream).

  • 1_Ctrl_exons_dPSI0.025.tsv.gz: Same as '1_KD_exons_dPSI0.1.tsv.gz', but for all non-changing events.

  • 2_paired_datasets.tsv.gz: Paired datasets in tidy format, where Knockdown-sensitive exons and their Control pairs come in consecutive lines. The 'rbp_name' column refers to the individual knockdown experiment where that exon was observed.

  • 2_paired_datasets_negative_dPSI.tsv.gz, 2_paired_datasets_positive_dPSI.tsv.gz: Same as '2_paired_datasets.tsv.gz', but knockdown-sensitive exons are split according to the direction of dPSI observed in the RNA-Seq data (along with the respective control pair).
  • 2_paired_datasets_individualRBPs: This folder contains the paired datasets in wide format, where a single line contains both the knockdown-sensitive and control pair. In addition, each paired dataset (knockdown of individual RBP) is written in a separate file.
Details of the sh knockdown RNA-Seq analysis
Because in the ENCODE study (Van Nostrand E.L. et al., 2020), authors analyzed knockdown RNA Seq data using an older version of the human genome (hg19) along with old genome annotations (GENCODE v19), I reanalyzed ENCODE data aligned to the hg38 genome build. I used rMATS v4.1.2 on each RBP knockdown experiment to detect differentially spliced events between the two knockdown replicates vs the two control replicates. rMATS was run with GENCODE annotations v44 and specifically tweaked with --cstat 0.05.
 
Significant knockdown-sensitive events were identified with a deltaPSI > |0.1|, using a False Discovery Rate cutoff of 0.05. Non-changing events, assumed as knockdown-agnostic controls, were defined as those exhibiting negligible deltaPSI variation (< |0.025|). To ensure the high quality of the exon sets, further analytical steps were performed. First, I applied a read coverage filter, by retaining events where the median coverage across replicates per condition for the isoform with more read counts was higher than 7. Then, I exclusively focused on exon skipping events in protein-coding genes, and filtered out unannotated exons (pseudoexons) as well as first or last exons of genes. In addition, I excluded duplicate exon skipping events by picking the transcript with the highest biological importance (based on the presence of transcript flags such as MANE selected, CCDS, or APPRIS). A total of 15,235 events were detected across all RBP knockdown experiments (N=72, splicing-associated RBPs with data available for the HepG2 cell line), covering 6,659 unique exons.
 

Files

Files (55.6 MB)

Name Size Download all
md5:446f154c0aeae986e44fa68a47205974
55.6 MB Download

Additional details