Identification of human mitochondrial RNA cleavage sites and candidate RNA processing factors

Hodgkinson, Alan; Ali, Aminah; Carbajosa, Guillermo

doi:10.5281/zenodo.6799020

Published June 4, 2021 | Version v1

Journal article Open

Identification of human mitochondrial RNA cleavage sites and candidate RNA processing factors

1. King's College London

# Mitochondrial RNA Cleavage identification workflow

The scripts in this repository together with other freely available bionformatics tools listed in the workflow below will let you calculate mitochondrial RNA cleavage ratios from any set of RNA-Seq fastq files. It is possible to adapt the pipeline to run with single end reads but the following workflow applies to paired end read data only.


These are the steps needed to go from fastq files to cleavage ratios:

1 - Trim sequence adapters from the fastq files with Trimgalore. We recommend using stringency 3. We recommend keeping quality as 0 because we are interested in the detection the position where the RNA fragments end and not the base calling. The command would be something like the following:

    ./trim_galore --quality 0 --stringency 3 --output_dir 3 --paired fq_file1 fq_file2
 
2- Trim PolyA reads using prinseq. We recommend trimming 5 bases. This step uses as input the fastq files obtained on the previous step.This would be an example command: 

    perl prinseq-lite.pl -fastq fq_file1_from_trimgalore -fastq2 fq_file2_from_trimgalore -out_good fq_out_filename -min_len 20 -trim_tail_left 5 -trim_tail_right 5

3- Map the trimmed reads from the previous steps with STAR using the EndToEnd option to avoid soft-clipping and therefore detect more accurately the positions of the original RNA fragment ends. Please note tha STAR requires a large amount of RAM memory (up to 32G) so this step would have to be run in a suitably powered computer. The --outSAMunmapped Within option can be specificed to keep unmapped reads for QC purposes. The --limitBAMsortRAM 100000000000 option is also recommended to avoid computer crashes due to RAM overload. We recommend reading the STAR manual if in doubt about these or other options (see https://github.com/alexdobin/STAR) An example command would be:

    ./STAR --runThreadN 18 --alignEndsType EndToEnd --outSAMtype BAM SortedByCoordinate --genomeDir Path_2_STAR_genome_dir --readFilesIn $readFilesInPath/fq_file1_from_trimgalore_and_prinseq fq_file2_from_trimgalore_and_prinseq --outFileNamePrefix outFileNamePrefix 

4 - Keep only mitochondrial reads (specify MT or chrM depending on the genome reference used) reads in bam file using samtools (http://samtools.sourceforge.net/):

     samtools index outFileNamePrefix.bam 
     samtools view -hb  outFileNamePrefix.bam chrM > outFileNamePrefix.MT.bam

5 - Generate cleavage ratios running get_linear_cleavage_ratio.pl. The script requires 2 arguments:
    - The name of the bam file name obtained using the previous steps
    - A prefix to be used by the script to generate the output

    perl get_linear_cleavage_ratio.pl outFileNamePrefix.MT.bam my_out_cleavage >my_out_cleavage.log 2>my_out_cleavage.err



Once the ratios have been generated it is possible to determine the presence of significantly high ratios across samples choosing a cleavage ratio threshold. It is also possible to generate a matrix with the chosen thershold to use as input for quantitive trait loci analysis (for example with PLINK). These are the steps required: 


6 - Filter peaks running get_himalaya_ratios.pl. The script uses a cleavage ratio threshold (we recommend 0.1) and keeps the highest ratio above the threshold. When there is a cluster of peaks (Himalaya) within a bp range (we recommend 5bp) the script selects the peak with the highest ratio (everest) to represent the cluster. It can be run using a command like: 

    perl get_himalaya_ratios.checkCov.pl --ratiosFile=rat_filename --everestOutfile=eve_out_filename --himalayaOutfile=him_out_filename >him_out_filename.log 2>him_out_filename.err

7 - Using the filtered peaks from the previous step, you can filter across your samples for the presence of cleavage ratios across a proportion of your samples running generate_all_himalayas_peaks_presence.pl

     - Generate a list of files to process. For example:
        
        ls *eve.txt > my_list_of_everest_files.txt

     - Calculate in which proportion of the files each peak is present (e.g. if a peak is present in 2 out 4 files it will have a proportion of 0.5)

        perl generate_all_himalayas_peaks_presence.pl  my_list_of_everest_files.txt

8 - Finally you can generate a matrix of cleavage ratios per sample to use as input for, for example, quantitative loci analysis with PLINK (https://zzz.bwh.harvard.edu/plink/). Youy need as input:
    - Filename of the peaks presence positions table (required)
    - Filename with a list of ratio files you want to use as input (required)
    - Threshold for the proportion you want to use as threshold (optional, default is 0.5)
    - Filename for the ouput file (optional, default is peaks_presence_matrix.05.txt)
    - Read coverage (optional, default is 20)

    perl generate_all_himalayas_peaks_matrix.pl peaks_presence_full_table.txt ratio_filenames_list.txt 0.5 peaks_presence_matrix.05.txt 20


A worflow that illustrates the whole process is included in this Zenodo upload as a png. It is also available in GitHub at:

https://github.com/gcarbajosa/Mitochondrial_RNA_Cleavage/blob/main/CleavageRatiosWorkflow.png

Files

CleavageRatiosWorkflow.png

Files (247.5 kB)

Name	Size	Download all
CleavageRatiosWorkflow.png md5:5df9f6e7f34d4a0b1cf046e691614408	228.6 kB	Preview Download
generate_all_himalayas_peaks_matrix.pl md5:bcb7ffb00658eaa566a13b94633ed6a9	2.3 kB	Download
generate_all_himalayas_peaks_presence.pl md5:4a7b80792216f942bef1bc6c512e4b35	1.2 kB	Download
get_himalaya_ratios.pl md5:852be0b9e29d3266cd17576f3d7a7192	4.1 kB	Download
get_linear_cleavage_ratio_PE.pl md5:c0a123bb590a4219fe0b0037fb683bdd	6.2 kB	Download
README.md md5:03db400718ddb48a70ca63077db0e154	5.1 kB	Preview Download

	All versions	This version
Views	158	158
Downloads	226	226
Data volume	7.4 MB	7.4 MB

Identification of human mitochondrial RNA cleavage sites and candidate RNA processing factors

Authors/Creators

Description

Files

CleavageRatiosWorkflow.png

Files (247.5 kB)