1 Reading the data

The input files are:

Samples table
SRR sample condition group_or_time_point biol_rep tech_rep run library_layout read_length
none naives_345 WT 1 1 1 1 PAIRED - 101
none naives_266 WT 1 2 1 1 PAIRED - 101
none naives_344 WT 1 3 1 1 PAIRED - 101
none naives_325 WT 1 4 1 1 PAIRED - 101
none naives_440 KO 1 1 1 1 PAIRED - 101
none naives_278 KO 1 2 1 1 PAIRED - 101
none naives_287 KO 1 3 1 1 PAIRED - 101
none naives_340 KO 1 4 1 1 PAIRED - 101
none J7_300 WT 2 1 1 1 PAIRED - 101
none J7_305 WT 2 2 1 1 PAIRED - 101
none J7_189 WT 2 3 1 1 PAIRED - 101
none J7_290 WT 2 4 1 1 PAIRED - 101
none J7_400 KO 2 1 1 1 PAIRED - 101
none J7_187 KO 2 2 1 1 PAIRED - 101
none J7_291 KO 2 3 1 1 PAIRED - 101
none J7_399 KO 2 4 1 1 PAIRED - 101
none J14_193 WT 3 1 1 1 PAIRED - 101
none J14_307 WT 3 2 1 1 PAIRED - 101
none J14_350 WT 3 3 1 1 PAIRED - 101
none J14_295 WT 3 4 1 1 PAIRED - 101
none J14_351 KO 3 1 1 1 PAIRED - 101
none J14_186 KO 3 2 1 1 PAIRED - 101
none J14_192 KO 3 3 1 1 PAIRED - 101
none J14_394 KO 3 4 1 1 PAIRED - 101

2 Analysis at family level

2.1 FPKM for each family of Repetitive Elements

For each family of Repetitive Elements (in case of elements with no repFamily name or repFamilies belonging to more than one repClass I use repClass) I compute FPKM values, as follows: for each sample:

  • I compute the sum of counts for all elements belonging to that repFamily
  • I divide this sum by the total number of reads sequenced for that sample and multiply by 10⁶
  • I divide this number by the total sum of lengths (in Kb) of the elements belonging to that repFamily --> FPKM
  • I subtract from each FPKM the total FPKM of all transposons belonging to the DNA repClass

2.2 Heatmaps

The heatmaps are scaled by rows.

It is evident that the correction for DNA trasposons is necessary to avoid that some replicates show random higher expression of some elements. The most affected samples by DNA contamination (and therefore by the effect of the normalization) are samples naives_344, naives_440 and J14_394. I will exclude these samples from all the analysis below.

3 DE-Seq analysis of RNA transposons

I include the FPKM of DNA transposons as confounding factor in DESeq2 formula.

Before running the Differential Expression analysis, the data are pre-filtered to remove all repetitive elements with < 10 reads among all samples.

3.1 MA-plots

  • The threshold used for a dot to be coloured in the MA-plots is: p-value adjusted < 0.05 and mean expression > 20.
  • Transposable elements whose log2FoldChange > 2 (or < -2) are labeled.

I plot the raw counts for the significantly differentially expressed TE with mean expression > 20 in order to check the variation between replicates:

4 Better look at full length elements

I consider the following as full-length RNA transposons:

In rule 'count_on_TE' of the Snakefile, I also run the featureCounts command only for the annotated TE fulfilling the above requirements.

5 Correlation between expression and methylation for full length TE

Before running the Differential Expression analysis, the data are pre-filtered to remove all repetitive elements with < 10 reads among all samples.

For the following methylation-expression plot, only elements with a baseMean >20 are shown, because the very lowly expressed ones would look deregulated just because of noise affecting their small expression values.

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.7 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] knitr_1.30                  ggpubr_0.4.0               
##  [3] ggrepel_0.8.2               gridExtra_2.3              
##  [5] DESeq2_1.24.0               SummarizedExperiment_1.14.1
##  [7] DelayedArray_0.10.0         BiocParallel_1.18.1        
##  [9] matrixStats_0.57.0          Biobase_2.44.0             
## [11] GenomicRanges_1.36.1        GenomeInfoDb_1.20.0        
## [13] IRanges_2.18.3              S4Vectors_0.22.1           
## [15] BiocGenerics_0.30.0         pheatmap_1.0.12            
## [17] data.table_1.13.0           ggplot2_3.3.2              
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1       ggsignif_0.6.0         ellipsis_0.3.1        
##  [4] rio_0.5.16             htmlTable_2.1.0        XVector_0.24.0        
##  [7] base64enc_0.1-3        rstudioapi_0.11        farver_2.0.3          
## [10] bit64_4.0.5            AnnotationDbi_1.46.1   codetools_0.2-18      
## [13] splines_3.6.3          geneplotter_1.62.0     Formula_1.2-3         
## [16] broom_0.7.1            annotate_1.62.0        cluster_2.1.0         
## [19] ashr_2.2-47            png_0.1-7              compiler_3.6.3        
## [22] backports_1.1.10       Matrix_1.2-18          htmltools_0.5.0       
## [25] tools_3.6.3            gtable_0.3.0           glue_1.4.2            
## [28] GenomeInfoDbData_1.2.1 dplyr_1.0.2            Rcpp_1.0.5            
## [31] carData_3.0-4          cellranger_1.1.0       vctrs_0.3.4           
## [34] xfun_0.18              stringr_1.4.0          openxlsx_4.2.2        
## [37] irlba_2.3.3            lifecycle_0.2.0        rstatix_0.6.0         
## [40] XML_3.99-0.3           zlibbioc_1.30.0        scales_1.1.1          
## [43] hms_0.5.3              RColorBrewer_1.1-2     yaml_2.2.1            
## [46] curl_4.3               memoise_1.1.0          rpart_4.1-15          
## [49] latticeExtra_0.6-29    stringi_1.5.3          RSQLite_2.2.1         
## [52] SQUAREM_2020.4         highr_0.8              genefilter_1.66.0     
## [55] checkmate_2.0.0        zip_2.1.1              truncnorm_1.0-8       
## [58] rlang_0.4.8            pkgconfig_2.0.3        bitops_1.0-6          
## [61] evaluate_0.14          lattice_0.20-41        invgamma_1.1          
## [64] purrr_0.3.4            labeling_0.3           htmlwidgets_1.5.2     
## [67] cowplot_1.1.0          bit_4.0.4              tidyselect_1.1.0      
## [70] magrittr_1.5           R6_2.4.1               generics_0.0.2        
## [73] Hmisc_4.4-1            DBI_1.1.0              pillar_1.4.6          
## [76] haven_2.3.1            foreign_0.8-76         withr_2.3.0           
## [79] prettydoc_0.4.0        survival_3.2-7         abind_1.4-5           
## [82] RCurl_1.98-1.2         mixsqp_0.3-43          nnet_7.3-14           
## [85] tibble_3.0.3           crayon_1.3.4           car_3.0-10            
## [88] rmarkdown_2.4          jpeg_0.1-8.1           locfit_1.5-9.4        
## [91] grid_3.6.3             readxl_1.3.1           blob_1.2.1            
## [94] forcats_0.5.0          digest_0.6.25          xtable_1.8-4          
## [97] tidyr_1.1.2            munsell_0.5.0