Paired datasets to study alternative splicing regulation by individual RNA-binding proteins
Creators
- 1. LASIGE
- 2. Instituto de Medicina Molecular João Lobo Antunes
Description
This project stores datasets generated to study the regulation of alternative splicing using deep learning models (e.g., SpliceAI). In particular, these datasets were used to perform ablation studies (sequence perturbations at motif locations) to evaluate their effects on the deep learning model.
I used public RNA-Seq data from the ENCODE consortium to identify exons sensitive to the knockdown of RNA-binding proteins (RBPs). The idea is that exons sensitive to RBP knockdowns are more likely to be directly or indirectly regulated by such RBPs, hence providing hints on their regulation mechanisms. Importantly, I also generated paired control exons, which were not alternatively spliced upon RBP knockdown but have similar GC composition and length compared to the knockdown-sensitive exons (target exon and surrounding introns). These control sets were generated to account for potential confounding factors of gene architecture features and, therefore, focus only on RBP binding motifs and their regulatory logic.
Information about the files
After uncompressing the 'paired_dataset.tar.gz' file, a directory with multiple files will be created with the following structure:
- 0_rMATS_ES_events.tsv.gz: Summary tables of differential splicing analysis, with deltaPSI estimates referring to Ctrl - Knockdown groups. Important columns: 'target_coordinates' refers to the 1-based coordinates of the alternatively spliced exon, and 'group' indicates the individual knockdown experiments where the exon was observed to be alternatively spliced.
- 0_rMATs_ES_non_changing_events.tsv.gz: Summary tables of differential splicing analysis, but in this case, contains all non-changing events (dPSI < |0.025|).
-
1_KD_exons_dPSI0.1.tsv.gz: Table with knockdown-sensitive exons along with values for gene architecture features along the exon triplet (exon upstream, intron upstream, cassette exon, intron downstream exon downstream).
-
1_Ctrl_exons_dPSI0.025.tsv.gz: Same as '1_KD_exons_dPSI0.1.tsv.gz', but for all non-changing events.
-
2_paired_datasets.tsv.gz: Paired datasets in tidy format, where Knockdown-sensitive exons and their Control pairs come in consecutive lines. The 'rbp_name' column refers to the individual knockdown experiment where that exon was observed.
- 2_paired_datasets_negative_dPSI.tsv.gz, 2_paired_datasets_positive_dPSI.tsv.gz: Same as '2_paired_datasets.tsv.gz', but knockdown-sensitive exons are split according to the direction of dPSI observed in the RNA-Seq data (along with the respective control pair).
- 2_paired_datasets_individualRBPs: This folder contains the paired datasets in wide format, where a single line contains both the knockdown-sensitive and control pair. In addition, each paired dataset (knockdown of individual RBP) is written in a separate file.
Files
Files
(55.6 MB)
Name | Size | Download all |
---|---|---|
md5:446f154c0aeae986e44fa68a47205974
|
55.6 MB | Download |
Additional details
Related works
- Is derived from
- Dataset: https://github.com/PedroBarbosa/mutsplice/blob/main/notebooks/1_build_paired_datasets.ipynb (URL)
Software
- Repository URL
- https://github.com/PedroBarbosa/mutsplice/