1 Reading the data
The input files are:
- the output tables made by featureCounts (run in rule 'count_on_TE' of the Snakefile), which contain read counts of all repetitive elements (i.e. all repNames) for all samples
- the repetitive elements annotation ../../data/annotations/RepeatMasker_RepeatLibrary20140131_mm10.noGenes.noSimple.bed (converted to SAF format in rule of Snakefile) used for featureCounts command, that is used to retrieve repFamily and repClass for each repName. The annotation I chose is the most updated library (to date) of rmsk, from which I removed 'Simple repeats' and 'Low complexity regions'
- a table with total number of reads per sample in raw fastq files
- the samples' table, containing infos on experimental design:
SRR | sample | condition | group_or_time_point | biol_rep | tech_rep | run | library_layout | read_length |
---|---|---|---|---|---|---|---|---|
none | naives_345 | WT | 1 | 1 | 1 | 1 | PAIRED - | 101 |
none | naives_266 | WT | 1 | 2 | 1 | 1 | PAIRED - | 101 |
none | naives_344 | WT | 1 | 3 | 1 | 1 | PAIRED - | 101 |
none | naives_325 | WT | 1 | 4 | 1 | 1 | PAIRED - | 101 |
none | naives_440 | KO | 1 | 1 | 1 | 1 | PAIRED - | 101 |
none | naives_278 | KO | 1 | 2 | 1 | 1 | PAIRED - | 101 |
none | naives_287 | KO | 1 | 3 | 1 | 1 | PAIRED - | 101 |
none | naives_340 | KO | 1 | 4 | 1 | 1 | PAIRED - | 101 |
none | J7_300 | WT | 2 | 1 | 1 | 1 | PAIRED - | 101 |
none | J7_305 | WT | 2 | 2 | 1 | 1 | PAIRED - | 101 |
none | J7_189 | WT | 2 | 3 | 1 | 1 | PAIRED - | 101 |
none | J7_290 | WT | 2 | 4 | 1 | 1 | PAIRED - | 101 |
none | J7_400 | KO | 2 | 1 | 1 | 1 | PAIRED - | 101 |
none | J7_187 | KO | 2 | 2 | 1 | 1 | PAIRED - | 101 |
none | J7_291 | KO | 2 | 3 | 1 | 1 | PAIRED - | 101 |
none | J7_399 | KO | 2 | 4 | 1 | 1 | PAIRED - | 101 |
none | J14_193 | WT | 3 | 1 | 1 | 1 | PAIRED - | 101 |
none | J14_307 | WT | 3 | 2 | 1 | 1 | PAIRED - | 101 |
none | J14_350 | WT | 3 | 3 | 1 | 1 | PAIRED - | 101 |
none | J14_295 | WT | 3 | 4 | 1 | 1 | PAIRED - | 101 |
none | J14_351 | KO | 3 | 1 | 1 | 1 | PAIRED - | 101 |
none | J14_186 | KO | 3 | 2 | 1 | 1 | PAIRED - | 101 |
none | J14_192 | KO | 3 | 3 | 1 | 1 | PAIRED - | 101 |
none | J14_394 | KO | 3 | 4 | 1 | 1 | PAIRED - | 101 |
2 Analysis at family level
2.1 FPKM for each family of Repetitive Elements
For each family of Repetitive Elements (in case of elements with no repFamily name or repFamilies belonging to more than one repClass I use repClass) I compute FPKM values, as follows: for each sample:
- I compute the sum of counts for all elements belonging to that repFamily
- I divide this sum by the total number of reads sequenced for that sample and multiply by 10⁶
- I divide this number by the total sum of lengths (in Kb) of the elements belonging to that repFamily --> FPKM
- I subtract from each FPKM the total FPKM of all transposons belonging to the DNA repClass
2.2 Heatmaps
The heatmaps are scaled by rows.
It is evident that the correction for DNA trasposons is necessary to avoid that some replicates show random higher expression of some elements. The most affected samples by DNA contamination (and therefore by the effect of the normalization) are samples naives_344, naives_440 and J14_394. I will exclude these samples from all the analysis below.
3 DE-Seq analysis of RNA transposons
I include the FPKM of DNA transposons as confounding factor in DESeq2 formula.
Before running the Differential Expression analysis, the data are pre-filtered to remove all repetitive elements with < 10 reads among all samples.
3.1 MA-plots
- The threshold used for a dot to be coloured in the MA-plots is: p-value adjusted < 0.05 and mean expression > 20.
- Transposable elements whose log2FoldChange > 2 (or < -2) are labeled.
I plot the raw counts for the significantly differentially expressed TE with mean expression > 20 in order to check the variation between replicates:
4 Better look at full length elements
I consider the following as full-length RNA transposons:
- LINE1 >5kb
- IAP >6kb
- MMERVK10C >4.5kb
In rule 'count_on_TE' of the Snakefile, I also run the featureCounts command only for the annotated TE fulfilling the above requirements.
5 Correlation between expression and methylation for full length TE
Before running the Differential Expression analysis, the data are pre-filtered to remove all repetitive elements with < 10 reads among all samples.
For the following methylation-expression plot, only elements with a baseMean >20 are shown, because the very lowly expressed ones would look deregulated just because of noise affecting their small expression values.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.7 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] knitr_1.30 ggpubr_0.4.0
## [3] ggrepel_0.8.2 gridExtra_2.3
## [5] DESeq2_1.24.0 SummarizedExperiment_1.14.1
## [7] DelayedArray_0.10.0 BiocParallel_1.18.1
## [9] matrixStats_0.57.0 Biobase_2.44.0
## [11] GenomicRanges_1.36.1 GenomeInfoDb_1.20.0
## [13] IRanges_2.18.3 S4Vectors_0.22.1
## [15] BiocGenerics_0.30.0 pheatmap_1.0.12
## [17] data.table_1.13.0 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ggsignif_0.6.0 ellipsis_0.3.1
## [4] rio_0.5.16 htmlTable_2.1.0 XVector_0.24.0
## [7] base64enc_0.1-3 rstudioapi_0.11 farver_2.0.3
## [10] bit64_4.0.5 AnnotationDbi_1.46.1 codetools_0.2-18
## [13] splines_3.6.3 geneplotter_1.62.0 Formula_1.2-3
## [16] broom_0.7.1 annotate_1.62.0 cluster_2.1.0
## [19] ashr_2.2-47 png_0.1-7 compiler_3.6.3
## [22] backports_1.1.10 Matrix_1.2-18 htmltools_0.5.0
## [25] tools_3.6.3 gtable_0.3.0 glue_1.4.2
## [28] GenomeInfoDbData_1.2.1 dplyr_1.0.2 Rcpp_1.0.5
## [31] carData_3.0-4 cellranger_1.1.0 vctrs_0.3.4
## [34] xfun_0.18 stringr_1.4.0 openxlsx_4.2.2
## [37] irlba_2.3.3 lifecycle_0.2.0 rstatix_0.6.0
## [40] XML_3.99-0.3 zlibbioc_1.30.0 scales_1.1.1
## [43] hms_0.5.3 RColorBrewer_1.1-2 yaml_2.2.1
## [46] curl_4.3 memoise_1.1.0 rpart_4.1-15
## [49] latticeExtra_0.6-29 stringi_1.5.3 RSQLite_2.2.1
## [52] SQUAREM_2020.4 highr_0.8 genefilter_1.66.0
## [55] checkmate_2.0.0 zip_2.1.1 truncnorm_1.0-8
## [58] rlang_0.4.8 pkgconfig_2.0.3 bitops_1.0-6
## [61] evaluate_0.14 lattice_0.20-41 invgamma_1.1
## [64] purrr_0.3.4 labeling_0.3 htmlwidgets_1.5.2
## [67] cowplot_1.1.0 bit_4.0.4 tidyselect_1.1.0
## [70] magrittr_1.5 R6_2.4.1 generics_0.0.2
## [73] Hmisc_4.4-1 DBI_1.1.0 pillar_1.4.6
## [76] haven_2.3.1 foreign_0.8-76 withr_2.3.0
## [79] prettydoc_0.4.0 survival_3.2-7 abind_1.4-5
## [82] RCurl_1.98-1.2 mixsqp_0.3-43 nnet_7.3-14
## [85] tibble_3.0.3 crayon_1.3.4 car_3.0-10
## [88] rmarkdown_2.4 jpeg_0.1-8.1 locfit_1.5-9.4
## [91] grid_3.6.3 readxl_1.3.1 blob_1.2.1
## [94] forcats_0.5.0 digest_0.6.25 xtable_1.8-4
## [97] tidyr_1.1.2 munsell_0.5.0