This is a repeat and gene annotation pipeline written in snakemake that draws inspiration from the genofish annotation pipeline, albeit with new capabilities, major modifications and without the use of MAKER. We originally designed it for the annotation of a Phoxinus sp. genome assembly. However, its versatility allows it to be applied to various vertebrate genomes and the output can be as sophisticated as your input databases.
The pipeline was designed with ease-of-use in mind and to allow the user to personalize the run as much as possible without the need for editing the code directly. All functionalities and customizations are manageable from the config/config.yaml
file.
Direct modifications to the makefiles should be reserved for instances of significant program installation discrepancies. It has been tested on an SGE cluster, and users on SGE or different systems are encouraged to provide their feedback.
For repeat annotation, the pipeline employs RepeatModeler, RepeatMasker, Tandem Repeat Finder, and Dustmaker. Gene annotation is performed by BRAKER3.
All inputs need to be specified in the config/config.yaml
file.
"{sample_ID_per_read_pair}1.{ext}"
and "{sample_ID_per_read_pair}2.{ext}"
.{filename}.fasta
file../utils
library and a gm_key for Genemark placed inside the user's home directory.If you don't have protein or RNA sequences for your sample or species, you can:
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
or download an OrthoDB dataset provided by BRAKER3 from here
if you don't have RNA-seq data you can look for them in: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
and unpack them with:
fastq-dump --split-files path/to/YOUR_SRA
In case you don't have your own sequencing data, or your data is not sufficient, and you need to use either RNA or protein evidence input for BRAKER3, it is advized to curate your own datasets and avoid redundancy in the sequences as much as possible.
This pipeline allows you to run BRAKER3 with any datatype you have. For more, see the following Functionalities section
The following packages need to be set up separately. They can be installed as per the instructions on their respective GitHub pages or in a manner suitable for your system. If your setup differs significantly, you may need to manually adjust the calls to these programs within their respective makefiles found under 'workflow/rules/':
In this pipeline, we utilize modules, which can be modified in the config/config.yaml
file.
RepeatModeler, RepeatMasker and BRAKER3 are used as containers which we find worked best on our system. You can specify them in the "containers" section of the config/config.yaml
file.
For BUSCO, it's recommended to use it offline
. Download your relevant lineage from here, unpack and place in a folder:
curl -O <link>
tar -xf <file> # becomes <folder>
mkdir busco_downloads
mkdir busco_downloads/lineages
mv <folder> busco_downloads/lineages
Specify the path in the config/config.yaml
file.
In order to run the pipeline, you need a basic conda environment with python>=3.9 and mamba on your system. You can create that with:
conda create -y -n genome_annotation "python>=3.9" "mamba>=0.22.1" -c conda-forge -c bioconda -c defaults
conda activate genome_annotation
You can then install the environments used by the different rules of the pipeline in a terminal with an internet connection:
#conda activate genome_annotation
snakemake --snakefile workflow/Snakefile.smk --cores 5 -p -r -w 5 --verbose --use-conda --conda-create-envs-only
This can take several minutes and needs to be done every time the environment files are changed.
Note: Before running the pipeline you need to have singularity activated on the node that will run the snakemake command.
This pipeline allows the user to customize the run based on the input data.
BRAKER3 mode
to "ES"
in the config/config.yaml
"ET"
. See Input{filename}.hints
or {filename}.hints.gff
- like the data provided from BRAKER3, from a previous BRAKER3 run or from making the data yourself with a program like exonerate, HISAT2 and/or ProtHint, you need to set the mode to "hints"
"EP"
."ETP"
. The RNA will be mapped to your genome with STAR and the output of that will be used for BRAKER3.The user can additionally edit the options passed to BRAKER from the configuration file.
To test if the pipeline can work in your system, you can download the example data provided by BRAKER3 from here
Once you do, you can add these files as input in the config/config.yaml
file like so:
genome:
file: "path/to/genome.fa"
haplotype: "hap1"
rnaseq:
folder: "folder/to/rnaseq/"
type: "Stranded"
prot_db: "path/to/my_proteins.fa"
snakemake -s workflow/Snakefile.smk --dry-run --cores 5 -p -r -w 5 --verbose
The pipeline accepts one assembly fasta file at a time for better control of each run, resource management and to avoid overwriting temporary files.
The configuration "HAP"
variable can be used as a proxy for the assembly prefix and the haplotype used for each run.
You can then run the pipeline with:
snakemake \
--snakefile workflow/Snakefile.smk \
--keep-going \
--latency-wait 300 \
-j ${THREADS} \
--use-singularity \
--singularity-args "--home $PWD" \
--singularity-args "--bind $PWD/temp:/tmp" \
--default-resources "tmpdir='/path/to/tmp'" \
--verbose \
--use-conda \
--use-envmodules \
--printshellcmds \
--reason \
--nolock \
--rerun-triggers mtime \
--rerun-incomplete \
--stats "./stats.json"
For repeat masking we used the fasta_split_1.pl script provided by the Sigenae platform and used in the genofish project (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7050324/) to split the genome into smaller parts, and Fan Wei's repeat_to_gff.pl script to convert the trf and dustmasker output to gff3.
Additionally, we used custom perl code provided by the genofish pipeline ((https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7050324/) together with the rmOutToGFF3.pl script from the RepeatMasker suite (http://www.girinst.org/) to transform the RepeatMasker gtf to gff3 format.
You can create a represenation of the pipeline with:
snakemake -s workflow/Snakefile.smk --dag --forceall | dot -Tpng > graph_of_jobs.png
snakemake -s workflow/Snakefile.smk --dag --until final_masking | dot -Tpng > workflow/graph_of_jobs_repeat_masking.png
snakemake -s workflow/Snakefile.smk --filegraph --forceall | dot -Tpng > filegraph_all.png
snakemake -s workflow/Snakefile.smk --rulegraph --forceall | dot -Tpng > workflow/rulegraph_all.png
snakemake -s workflow/Snakefile.smk --rulegraph --until final_masking | dot -Tpng > workflow/rulegraph_repeat_masking.png
Bruna, T., Hoff, K.J., Lomsadze, A., Stanke, M., & Borodovsky, M. (2021). BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database. NAR Genomics and Bioinformatics 3(1):lqaa108, doi: 10.1093/nargab/lqaa108.
Hoff, K.J., Lomsadze, A., Borodovsky, M. and Stanke, M. (2019). Whole-Genome Annotation with BRAKER. Methods Mol Biol. 1962:65-95, doi: 10.1007/978-1-4939-9173-0_5.
Hoff, K.J., Lange, S., Lomsadze, A., Borodovsky, M. and Stanke, M. (2016). BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 32(5):767-769.
Stanke, M., Diekhans, M., Baertsch, R. and Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics, doi: 10.1093/bioinformatics/btn013.
Stanke. M., Schöffmann, O., Morgenstern, B. and Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62.