Genome annotation notebook
Introduction
The goal of this tutorial is to annotate the proteins from two genomes. Therefore, we will:
- download two genomes from NCBI
- annotate them using several databases such as the COGs, arCOGS, KOs, PFAM, …
- combine the results into one comprehensive table.
If you already have proteins for your genome of interest, you do not need to do the first steps. You can either start with the mapping file part (which runs prokka for gene calling and includes some bash steps to add summary statistics, such as the bin id, contig name, contig length to the protein files) or if you already have protein files you can start at the step ‘do the annotations’. The only thing you might want to generate is a file that lists all the proteins you are working with.
General comments
- A lot of programs in this workflow use for loops or GNU parallel to speed things up. While this might not make so much sense for only two genomes, if you work with thousands of files this becomes very useful.
- For most annotation steps we can use of several CPUs. For this tutorial we only use few CPUs but this should be adjusted according to the available resources you have available. More on how to check resources can be found in the Unix tutorial
- This script uses a lot of AWK for parsing. If you want to have some more insights into how to use AWK we also have a AWK tutorial available.
- Once you have the annotation table, we provide an R tutorial here that gives an example how to summarize the data.
- The individual database searches might take a bit (esp. the interproscan and the diamond search against the NCBI database). Therefore, we will not run every database search during this tutorial. However, we will provide all intermediate data for you in a repository so you can have a look at what kind of data is generated.
General info on this tutorial
- grey box. : The code that we want to execute
- red box. : If text is highlighted by a red box, this highlights additional code that is useful but code that we WILL NOT use during this tutorial
- Hidden code : In some cases the code might be hidden and you only see a little box at the right hand side. Hidden codes means that from the previous sessions you should have enough knowledge to do this step or it is is a test of your skills. If you run into issues you can open the code and check how to run your command by clicking on the
Code
button to reveal the code - hyperlink : If the text is highlighted like this during a text section, then you can click on it and get forwarded to websites with more information, i.e. a publication or an online tutorial.
- Exercises/Questions/Homework: : This will appear if there are some questions for you do check if you understood everything but this is not required to finish the tutorial. Some of these are more time-consuming and might be best to run on your own after you have finished the complete workflow.
- Sometimes we also just asked questions to think about the output of certain programs. The text is then hidden and you have to hover over the “spoiler” and or click a Show/hide buttonto see the answer to the question.
General
Below you find a list of the used programs. If you want to run this pipeline, please install the listed Custom scripts and databases can be found in a Zenodo repository.
General notice for people interested in eukaryotic genomes:
This tutorial is designed specifically with prokaryotic genomes in mind but can be used for for eukaryotic proteins as well. This especially is easy if you already have proteins for your genomes of interest. If you do not have such files beware that the tool we use for protein-calling, prokka, is specific for prokaryotic genomes. Tools for identifying eukaryotic proteins are, among others, Augustus for single genomes or MetaEuk for metagenomes.
Something to keep in mind with the used databases is that these were often generated using only archaeal and/or bacterial genomes. Only of a few of the databases used here, such as the KOhmms, also including eukaryotic genomes. A good alternative to add are the KOGs that are specifically designed for eukaryotes. The most recent profiles provided by EggNOG can be found here.
Version programs
The versions might differ from what you have available on your system and should not be problematic but for consistency we list the version numbers that were used to generate this workflow as well as links with install instructions.
- Python 2.7.5
- perl v5.16.3
- prokka 1.14-dev ; install instructions can be found here
- HMMER 3.1b2; install instructions can be found here
- blastp v2.7.1+; install instructions can be found here
- diamond 0.9.22; install instructions can be found here
- interproscan: InterProScan-5.29-68.0; install instructions can be found here
- GNU parallel 20181222; install instructions can be found here
Version databases
Databases (and mapping files) used in the current version can be downloaded from here.
- KOhmms: downloaded 90419 from here
- PGAP (former TIGRs): downloaded Nov 2021 from here
- Pfams: RELEASE 34.0, downloaded Nov 2021 from here
- CAZymes: dbCAN-HMMdb-V9 downloaded on 21 April 2020 from here
- Merops: downloaded from Merops server in November 2018 from here
- HydDB: downloaded form HydDB webserver in July 2020 from here
- COGs: downloaded from NCBI Octover 2020 from here
- TransporterDB: downloaded from TCDB April 2021 here
- ncbi_nr (maintained and downloaded by Alejandro Abdala Asbun). Last update was done on August 2021
- ArCOGs: downloaded in 2019 and available here (update date: 2018)
External scripts needed
Custom scripts used in the current version can be downloaded from [here] (https://zenodo.org/deposit/6489670).
perl
- length+GC.pl
python
- Parse_prokka_for_MAGs_from_gbk-file.py
- Split_Multifasta.py
- parse_IPRdomains_vs2_GO_2_ts_sigP.py
- parse_diamond_blast_results_id_taxid.py
Setup working environment
Define variables
Notice: If you have done the assembly and binning tutorials, you can see that we do things a bit differently here. Since for the annotations we work with a lot of different databases, there are two ways how to deal with a lot of files in different locations:
- we can add the path for each database into the code itself
- we can define variables and store the file paths inside these variables
Especially if you work with collaborators it often is easier to use the second option, as you just have to adjust the paths in one spot.
Instructions:
- Change the first variable, the wdir, to the path were you want to work from and change the paths for the databases according to where you downloaded the databases.
- The two last lines of code are needed for the software prokka and are specific for the system the script was written on. Adjust according to how prokka is setup in your system.
- For this trimmed tutorial we only work with the COG, arCOG, KO and Transporter Database, so the key paths to exchange are for those three!
#set the path to our working directory
wdir="/export/lv1/user/ndombrowski/Mai_testing"
#nr of cpus we generally want to use
cpus=4
#location of some custom python and perl scripts
python_scripts="/export/lv1/user/spang_team/Scripts/Others/"
perl_scripts="/export/lv1/user/spang_team/Scripts/Others/"
#locations of the different databases
cog_db="/export/data01/databases/cogs/hmms/NCBI_COGs_Oct2020.hmm"
arcog_db="/export/data01/databases/arCOG/arCOGs2019/All_Arcogs_2018.hmm"
ko_db="/export/data01/databases/ko/combined_hmms/All_KOs.hmm"
pfam_db="/export/data01/databases/pfam/Pfam-A.hmm"
tigr_db="/export/data01/databases/PGAP_ProtFam/hmm_PGAP.lib"
cazy_db="/export/data01/databases/cazy_spt/dbCAN-HMMdb-V7.txt"
transporter_db="/export/data01/databases/tcdb/tcdb"
hyd_db="/export/data01/databases/HydDB/HydDB_uniq"
ncbi_nr_db="/export/data01/databases/ncbi_nr/diamond/nr.dmnd"
#locations of the mapping files for the different databases (mapping files link the database ID to a more useful description)
cog_mapping="/export/data01/databases/cogs/hmms/cog_mapping.txt"
arcog_mapping="/export/lv1/user/spang_team/Databases/arCOGs2019/ar14.arCOGdef18.final.tab"
ko_mapping="/export/data01/databases/ko/ko_list_for_mapping_clean"
pfam_mapping="/export/data01/databases/pfam/Pfam-A.clans.cleaned.tsv"
tigr_mapping="/export/lv1/user/spang_team/Databases/PGAP/hmm_PGAP_edited.tsv"
cazy_mapping="/export/data01/databases/cazy_spt/CAZY_mapping_2.txt"
transporter_mapping="/export/lv1/user/spang_team/Databases/TransporterDB/tcdb_definition_2021.tsv"
hyd_mapping="/export/data01/databases/HydDB/HydDB_mapping"
taxon_db="/export/data01/databases/taxmapfiles/ncbi_nr/prot.accession2taxid.gz"
#set environment to work for prokka
export PERL5LIB=/export/lv1/public_perl/perl5/lib/perl5
#add libraries so that tbl2asn works
export PATH=$PATH:/opt/biolinux/prokka/binaries/linux/
Set our working directory
Let’s first go to our working directory and make a new folder for the Annotation tutorial.
#check that we set the variable to our working directory correctly
echo $wdir
#change the directory to your home directory
cd $wdir
#make a folder for the Annotations and move into this directory
mkdir Annotations
cd Annotations
Prepare genomes of interest
In this workflow we will work with 2 genomes that we download from NCBI but the steps can easily be adjusted for your own genomes. You could also work with the bins we have assembled and binned in the previous workflows, if you feel comfortable with the command line feel free to use those instead.
First, we will download two archaeal genomes from NCBI and rename them a bit to avoid issues with the workflow. The bins from the previous workflows (inside ../Binning/4_FinalBins/renamed) already have a good header and should be ready to use in case you want to repeat the workflow with those genomes.
After we hacve downloaded the 2 genomes from NCBI the file name and file header should look something like this:
We can see that both the file name and file header are long and has a lot of symbols UNIX might have trouble with (brackets, spaces, dots…) and generally, something to watch out for when naming files is:
Have short but descriptive file names and file headers. I.e. if we look at the header of the NCBI genomes you will see the header is rather long, so it is useful to condense it.
Avoid extra symbols (space, comma, semicolon, quotes, brackets) as these can easily result in unwanted behavior. I.e. a very simple naming could only include two symbols:
- underscores,
_
, for everything where we normally would use a space - minus symbol
-
for key information that we might want to separate later (i.e. genome name and protein ID)
- underscores,
Another useful thing is to add information into the headers that allow us to concatenate multiple genomes and afterwards still know from what genome a sequence came from. Therefore, we recommend to:
- Add the genome name into the fasta header and use a unique delimiter to separate it from the contig id. That way you can concatenate your genomes and afterwards still know from what genome a contig came from
- have a consistent way to separate the genome and protein ID across different analyses. In phylogenies its very common to concatenate several genes/proteins from individual genomes and for this it is useful to remove the proteinID from the fasta header.
#make folders in which we store the genomes
mkdir fna
mkdir fna/renamed
#download genomes from NCBI using wget
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/085/GCA_000008085.1_ASM808v1/GCA_000008085.1_ASM808v1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/945/GCA_000017945.1_ASM1794v1/GCA_000017945.1_ASM1794v1_genomic.fna.gz
#unzip files and check the file names
gzip -d *gz
ls -l *.fna
#shorten the file name by cutting everything away after the dot
for f in *.fna; do mv $f "${f/.*}".fna; done
#check the file name again
ls -l *.fna
#check how the header looks --> we have a long name and a lot of symbols we do not want, lets remove these
head GCA_000008085.fna
#shorten the fasta headers by removing everything after the space
for f in *.fna; do cut -d " " -f1 $f > fna/${f}; done
head fna/GCA_000008085.fna
#add the genome name into the header (this is useful if we want to concatenate a lot of genomes into one file)
cd fna
for i in *fna; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fna/,x)}1' $i > renamed/$i; done
head renamed/GCA_000008085.fna
#remove intermediate files
rm *fna
cd ..
rm *fna
The cleaned up files and file names should look something like this afterwards:
The steps above work fine but what if we have hundreds of genomes from NCBI?
For the tutorial we will NOT run the code below but continue at the step:
Make a file list for all genomes of interest. However, if you deal a lot with genomes from NCBI and you want to have a file that links a ncbi tax id to a tax string, this still might be useful to know. I.e. if you work with a genome of interest, for example E.coli, you might want to include several closely related genomes as reference in your analysis and NCBI is an excellent resource providing you with a multitude of genomes as well as metadata.
Notice:
- The file that links the taxonomy ID with the taxa string (lineages-2020-12-10.csv) was generated with ncbitax2lin (instructions added below)
- this part of the workflow was set up to run at servers at the NIOZ and the paths need to be adjusted to work on your system
The first part of this workflow is to get a list of all available archaeal (or bacterial) genomes from ncbi and add a taxa string into the file we downloaded.
#prepare some folders
mkdir ncbi_genomes
cd ncbi_genomes
#get a list of all avail. archaeal genomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
This file includes a list of all genomes available at NCBI and relevant metadata such as the project ID, taxonomy ID, species name, download link for individual genome files etc.
Next, we will clean the file a bit and since the species name might not be informative enough (i.e. we might want to know to what phylum a species belongs), we want to add the full taxonomy string in.
#clean up the file by removing the first row and any `#` and move the tax id to the front
sed 1d assembly_summary_genbank.txt | sed 's/# //g' | awk -F"\t" -v OFS="\t" '{print $6, $0} ' | sed 's/ /_/g' > assembly_summary_edit.txt
#replace empty rows with a "none" and remove the header
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' assembly_summary_edit.txt | sed '1d' > temp1
#add in the tax string (to be able to better select genomes)
#the code for how to generate the csv needed here can be found below
#merge assembly info with tax string
LC_ALL=C join -a1 -j1 -e'-' -o 0,1.2,1.3,1.4,1.5,1.6,1.8,1.9,1.10,1.11,1.13,1.16,1.17,1.18,1.19,1.21,2.2 <(LC_ALL=C sort temp1) <(LC_ALL=C sort /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020/lineages-2020-12-10.csv) -t $'\t' > temp2
#add in headers
echo -e "taxid\taccession\tbioproject\tbiosample\twgs\trefseq_category\tspeciesID\torganism\tintraspecific_name\tisolate\tassembly\trel_date\tasm\tsubmitter\tgcf_name\tftp_path\ttax" | cat - temp2 > assembly_summary_archaea_Dec2020_tax.txt
cd ..
In this file you can just look for the genomes you are interested in. You can subset for lineages of interest by selection specific groups via their taxonomy and you might run CheckM as well to select them based on their genome stats.
Once you have your genomes selected that you are interested in you then can copy the ftp_links into a document called get_genomes.txt
via nano and then do some small changes to the file with sed. Afterwards you can download all the genomes from NCBI.
# extend path to be able to download the genomes
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' get_genomes.txt > get_genomes_v2.txt
# download genomes
for next in $(cat get_genomes_v2.txt); do wget "$next"; done
Once you have the genomes, you can clean the names as suggested above and continue with the steps below.
In case you want to generate the csv file that links the ncbi taxID to taxstring, which includes the phylum, class, order, etc information, this is the workflow (you will need to download ncbitax2lin):
#change to the directory in which we installed ncbitax2lin
cd /export/lv1/user/spang_team/Scripts/NCBI_taxdump/ncbitax2lin
#start bashrc that allows to run conda
source ~/.bashrc.conda2
#setup virtual environment (see more instructions here: https://github.com/zyxue/ncbitax2lin)
#virtualenv venv/
#re-do taxonomy mapping file
make
#unzip file
gzip -d lineages-2020-12-10.csv.gz
#close virtual environment
#source deactivate venv/
#make a new directory for the most current tax
mkdir /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020
#change dir
cd /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020
#cp the taxa file into our new dir and replace spaces with underscore
sed 's/ /_/g' ~/../spang_team/Scripts/NCBI_taxdump/ncbitax2lin/lineages-2020-12-10.csv > temp1
#only print the levels until species level
awk -F',' -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8}' temp1 > temp2
#add in “none” whenever a tax level is emtpy
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' temp2 > temp3
#merge columns 2-8, thus we have 1 column with the tax number and one column with the tax string
awk ' BEGIN { FS = OFS = "\t" } {print $1,$2","$3","$4","$5","$6","$7","$8}' temp3 > temp4
#remove header
sed '1d' temp4 > lineages-2020-12-10.csv
#clean up
rm temp*
Make a file list for all genomes of interest
Make a file that lists all the genomes we are working with. Such a file list is useful when working with for loops.
#make a folder for all the secondary files we need
mkdir FileLists
#list our genomes of interest
ls fna/* | sed 's/\.fna//g' | sed 's/fna\///g' > FileLists/GenomeList
#check content of file
head FileLists/GenomeList
The last command should print something like this:
Call proteins with prokka and clean files
Now that we have the genomes organized, lets identify the coding regions and call the proteins with Prokka. Another common tool you might encounter for this step is prodigal. Prokka basically uses prodigal but is doing some extra annotations, which we also want to look at.
Notice:
- For Prokka we have some specific setups that are needed for Prokka to run on the servers this workflow was written for. If you are on your own system adjust accordingly.(the export functions are part of the start of the script, were we define all the additional environmental variables).
- Our genomes are archaeal, if you work with bacterial genomes change the –kingdom Archaea part below. This setting has no major differences for calling proteins but the Prokka annotations might slightly change.
- Prokka is not designed to work with eukaryotic genomes, an alternative to look into for eukaryotes is for example Augustus.
- Prokka does very quick and dirty annotations, these are good for a first glimpse into what you have but in our experience there can be several issues, so use them with caution.
Run prokka
Prokka: a tool for rapid prokaryotic genome annotation (Seeman et al., 2014).
#prepare new directories to store our data
mkdir faa
mkdir faa/Prokka
#run prokka
for sample in `cat FileLists/GenomeList`; do prokka fna/renamed/${sample}.fna --outdir faa/Prokka/${sample} --prefix $sample --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus 2 --norrna --notrna ; done
#control number of Genomes
ll faa/Prokka/*/*faa | wc -l
#check how the genome header looks like
head faa/Prokka/GCA_000008085/GCA_000008085.faa
We have 2 genomes and the header of the first file should look something like this:
We can see that prokka gives a unique protein ID and adds an annotation to the file header.
Additionally, Prokka generates several output files, these include:
Additonally, prokka does not only call proteins but also identifies tRNAs, does protein anntations, etc.
Concatenate genomes
If we look at the fasta headers of the newly generated files, we see that the header is not very descriptive. I.e. we have something like this:
>CALJACFG_00010 ATP-dependent DNA helicase Hel308
.
This header is very long and uses a lot of extra symbols (a space, and a minus symbol, which we already use for other purposes). Therefore, we first want to clean things up a bit.
Additionally, since for the following workflow we want to concatenate the two genomes, i.e. merge the two files into one, it is useful to add the genome name into the fasta header before merging the files.
This is optional, but it is useful to use a special delimiter to later separate the genome from the protein name. Therefore we add the file name and separate it with the old sequence header with a -
.
#go into the prokka folder
cd faa/Prokka/
#organize files
mkdir renamed
cp GCA*/*faa .
#add the filename into the fasta header
for i in *faa; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.faa/,x)}1' $i > renamed/$i; done
#cleanup temp files
rm *faa
#concatenate files
cat renamed/*faa > All_Genomes.faa
#clean the header/shorten it by removing everything after the space
cut -f1 -d " " All_Genomes.faa > All_Genomes_clean.faa
#go back to the wdir
cd ../..
#check the fasta header
head faa/Prokka/All_Genomes_clean.faa
#make a blast db out of concat genomes (we need this whenever we want to run blast)
makeblastdb -in faa/Prokka/All_Genomes_clean.faa -out faa/All_Genomes_clean -dbtype prot -parse_seqids
Now our file header should look something like this:
If you work with your own genomes
- Ideally have not only the protein ID but also the genome ID sequence header
- make sure the file header is concise and does not have ANY spaces and that ideally uses a ‘-’ to separate the binID from the protein ID
- If you have a concise header + the bin ID in the header, first concatenate the protein sequences of your genomes in one file.
More specifically you want to:
- for the fna files
- generate a folder named
fna/renamed
- add the individual genomes in there
- for the faa files:
- generate the file in the folder
faa/Prokka
- concatenate the proteins from all your genomes of interest and name the file
All_Genomes_clean.faa
- continue with the steps below
Make a mapping file with all relevant genome info
Now that our files are ready we first want to make a table with the basic stats (i.e. contig length, contig gc, protein length, …).
Get contig stats
Prokka has the habit to rename the fasta header. I.e. the original contig name is renamed. Therefore, it is hard to link from what contig the proteins originally came from. The workflow below generates a mapping file that links the original contig name with the prokka contig name.
The goal is to generate a file with this information:
- old contig name, binID, contig GC, contig length
Notice
- This part depends a lot on the files having short naming schemes and no extra symbols. So make sure that the final file has the 4 columns we would expect.
Comment if working with your own genomes:
If working with your own genomes, esp. step 4 can be problematic. This step will ONLY work if your fasta header includes the binID and the contig ID that are separated with a -
. If you do not have both pieces of info, have extra symbols symbol or if that symbol is somewhere else in the filename/header this part of the workflow will very likely not work (without needing adjustments).
In this case it will be easiest skip this step and do the following:
(A) generate the file FileLists/1_Bins_to_protein_list.txt in the section Prepare list that links proteins with binIDs
(B) continue with the step Do the annotations.
(C) During the first annotation step with the COG database, ignore the step #combine with previous data frame and do cp NCBI_COGs/A_NCBI_COGs.tsv merge/temp_A.tsv
to generate the first table.
#make folder
mkdir contig_mapping
#for each contig list the gc and contig length (we use this later to control whether the merging of information works ok)
for sample in `cat FileLists/GenomeList`; do perl $perl_scripts/length+GC.pl fna/renamed/${sample}.fna > contig_mapping/${sample}_temp1; done
#add number of contigs as separate column
for sample in `cat FileLists/GenomeList`; do awk '$1=(FNR FS $1 FILENAME)' contig_mapping/${sample}_temp1 > contig_mapping/${sample}_temp2; done
#cleanup file: add in binIDs, so we have: contig id, binID, contig nr, contig gc and contig length
for sample in `cat FileLists/GenomeList`; do awk 'BEGIN{OFS="\t"}{split($2,a,"-"); print a[2],$1,$3,$4}' contig_mapping/${sample}_temp2 > contig_mapping/${sample}_temp3; done
for sample in `cat FileLists/GenomeList`; do awk 'BEGIN{OFS="\t"}{split($1,a,"/"); print a[1],a[2], $2,$3,$4}' contig_mapping/${sample}_temp3 | sed 's/contig_mapping//g' | sed 's/_temp1//g' > contig_mapping/${sample}_temp4; done
#combine results
cat contig_mapping/*temp4 > contig_mapping/Contig_Old_mapping.txt
#check file. heder should have --> contigID, binID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping.txt
#cleanup temp files
rm contig_mapping/*temp*
#create file for merging with prokka contigIDs. Here, we add the binID with the contig number together to recreate the prokka contig name
awk 'BEGIN{OFS="\t"}{print $2"_contig_"$3,$0}' contig_mapping/Contig_Old_mapping.txt > contig_mapping/Contig_Old_mapping_for_merging.txt
#check file: prokka contig name, prokka binID, binID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping_for_merging.txt
In the file we see the contig name, contig GC and contig length. Since the two genomes are complete genomes we only work with two contigs in total. If you work with MAGs this table will get much longer.
Prepare list that links proteins with binIDs
Now, we want to generate a file that lists all proteins we are working and a second column with the bin ID. So we want two columns: proteinID (accession), binID.
#get list of protein accession numbers
grep "^>" faa/Prokka/All_Genomes_clean.faa > temp
#remove the prokka annotations from the file header
cut -f1 -d " " temp > temp2
#remove the ``>``
sed 's/>//g' temp2 > FileLists/AllProteins_list.txt
#check file; we want one column with the proteinIDs
head FileLists/AllProteins_list.txt
#remove temp files
rm temp*
#Modify protein list to add in a column with binID
awk -F'\t' -v OFS='\t' '{split($1,a,"-"); print $1, a[1]}' FileLists/AllProteins_list.txt | LC_ALL=C sort > FileLists/1_Bins_to_protein_list.txt
#check file content: we now have the accession and the binID
head FileLists/1_Bins_to_protein_list.txt
#check with how many proteins we work
wc -l FileLists/1_Bins_to_protein_list.txt
We now have a list with 2,069 proteins and the respective genome each protein belongs to. The first few lines of the file should look something like this:
Throughout the workflow (or whenever generating your own pipeline), check that the number of protein stays consistent.
Additionally, while running things for the first time it is always good to check the files we generate to get a better feeling on:
- what is happening at specific points in the script
- Find potential bugs
Prepare a file with prokka annotations & Lengths & other genome info
Now, we want to generate a file that links the contig info and all the protein info. For this to work we need gbk files = a standard Genbank file generated by prokka derived from the master .gff. GFF = stands for General Feature Format and is a master annotation file in GFF3 format, containing both sequences and annotations. Such files can be viewed directly in tools like the integrative genomics viewer (IGV).
#compress gbk files (generated by prokka)
gzip faa/Prokka/*/*gbk
#make list for looping
for f in faa/Prokka/*/*gbk.gz; do echo $f >> FileLists/list_of_gzips; done
#control that the nr of files we have is correct: 2 genomes
wc -l FileLists/list_of_gzips
#create a summary file based on the gbk file
#lists the binname, contig name, contig length, proteinID, prokka annotation
python $python_scripts/Parse_prokka_for_MAGs_from_gbk-file_v2.py -i FileLists/list_of_gzips -t FileLists/Contig_Protein_list.txt
#check output of the script: Prokka binID, Prokka contig ID, contig length, gene start, gene stop, strand, accession, prokka annotation
head FileLists/Contig_Protein_list.txt
The output should look something like this
Next, we want to have a list that assigns the new ID prokka gives to a genome to the old genome name we got from NCBI:
#make prokka to binId mapping file
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2],$2}' FileLists/1_Bins_to_protein_list.txt | sed 's/_[0-9]*\t/\t/g' | sort | uniq > FileLists/Prokka_to_BinID.txt
#check file: Prokka bin ID, old bin ID
head FileLists/Prokka_to_BinID.txt
Finally, we can merge everything together.
#link old to new binIDs
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' FileLists/Prokka_to_BinID.txt FileLists/Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$10,$2,$2,$3,$4,$5,$6,$7,$8}' > temp_Bin_Contig_Protein_list.txt
#add in extra column for contig nr
awk -F'\t' -v OFS='\t' 'gsub("[A-Za-z0-9]*_","",$4)' temp_Bin_Contig_Protein_list.txt | awk -F'\t' -v OFS='\t' '{print $2"_contig_"$4,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' > FileLists/Bin_Contig_Protein_list.txt
#check file
head FileLists/Bin_Contig_Protein_list.txt
#merge with old contig IDs
#headers: accession, BinID, newContigID, oldContigID, mergeContigID,ContigLengthNew, LengthContigOld, GC,ProteinID, protein start, protein end, protein strand, prokka annotation
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' contig_mapping/Contig_Old_mapping_for_merging.txt FileLists/Bin_Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $3"-"$7,$3,$4,$13,$1,$6,$17,$16,$7,$8,$9,$10,$11 }' > FileLists/Bin_Contig_Protein_list_merged.txt
#check file
head FileLists/Bin_Contig_Protein_list_merged.txt
#prepare a file with protein length and gc to add protein length into our file
perl $perl_scripts/length+GC.pl faa/Prokka/All_Genomes_clean.faa > temp
#merge with contig/protein info
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' temp FileLists/Bin_Contig_Protein_list_merged.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8,$9,$10,$11,$12,$16,$13}' > temp2
#add a header
echo -e "accession\tBinID\\tNewContigID\tOldContigId\tContigNewLength\tContigOldLength\tGC\tProteinID\tProteinStart\tProteinEnd\tStrand\tProteinLength\tProkka" | cat - temp2 > FileLists/B_GenomeInfo.txt
#check file
less -S FileLists/B_GenomeInfo.txt
#control that all is ok
wc -l FileLists/B_GenomeInfo.txt
#2070
#remove temp files
rm temp*
The file stored in FileLists/B_GenomeInfo.txt
should look something like this:
Do the annotations
For this trimmed tutorial, we only will work with a subset of databases to reduce the computational time: COGs, arCOGs, KOs and the transporter database. If you have additional time, feel free to start other searches and compare the output to the three mentioned databases
All searches work have a similar workflow:
- Run the search (with hmmer, diamond, blast, etc.)
- For each protein, make sure we have just one unique hit (usually we select the best hit for each protein using the best e-value or bitscore). I.e. a single protein might be similar to several dehydrogenases and we only keep the best hit.
- Clean up the output from the search.
- Merge the output with our protein list (that way we make sure we always have the same number of rows in all the files we generate).
- add gene descriptions to our table, i.e. when we do the search against the COGs we only have the COG IDs, but we want to add a gene description here.
- add an informative header.
For all the searches, we keep the original input table and the final output table and the other intermediate files we delete.
Once we have run our searches against several different databases, we merge all the results in a step-wise manner (using the protein IDs) and then in the final step clean the final data frame by removing redundant information.
We will add basic information on the different databases and search algorithms in the last section of this tutorial, so if you want more details check there.
COG search
The COGs (Clusters of Orthologous Genes) is a database of orthologous gene clusters generated from a database from 1,187 bacteria and 122 archaea (2020 version) and therefore is useful if you work with organisms from these two groups.
Possible settings to change:
- number of CPUs
- e-value threshold
#1. Setup directories
mkdir NCBI_COGs
mkdir merge
#2. run hmmsearch against all COGs
hmmsearch --tblout NCBI_COGs/sequence_results.txt -o NCBI_COGs/results_all.txt --domtblout NCBI_COGs/domain_results.txt --notextw --cpu $cpus $cog_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select hits above a certain e-value
sed 's/ \+ /\t/g' NCBI_COGs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > NCBI_COGs/sequence_results_red_e_cutoff.txt
#get best hit/protein based on bit score, and e-value
sort -t$'\t' -k3,3gr -k4,4g NCBI_COGs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > NCBI_COGs/All_NCBI_COGs_hmm.txt
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort NCBI_COGs/All_NCBI_COGs_hmm.txt) | LC_ALL=C sort > NCBI_COGs/temp4
#merge with COG mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.2,2.5,1.3 <(LC_ALL=C sort -k2 NCBI_COGs/temp4) <(LC_ALL=C sort -k1 $cog_mapping) | LC_ALL=C sort > NCBI_COGs/temp5
#add headers
echo -e "accession\tNCBI_COG\tNCBI_COG_Description\tPathwayID\tPathway\tNCBI_COG_evalue" | cat - NCBI_COGs/temp5 > NCBI_COGs/A_NCBI_COGs.tsv
#combine with previous data frame
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' NCBI_COGs/A_NCBI_COGs.tsv FileLists/B_GenomeInfo.txt > merge/temp_A.tsv
#check file
less -S merge/temp_A.tsv
#clean-up
rm NCBI_COGs/temp*
In step 2, hmmsearch will generate multiple outputs (for details on the individual data found in these tables see chaper 5, page 45) of the HMMER manual:
- The –tblout output option produces the target hits table. The target hits table consists of one line for each different query/target comparison that met the reporting thresholds, ranked by decreasing statistical significance (increasing E-value).
- The –domtblout option produces the domain hits table. There is one line for each domain. There may be more than one domain per sequence. T
- A table with the raw output, including alignments (-o option). Notice: This file can get rather large and if you have space issues, its not essential and can be deleted.
After parsing our table that is stored in NCBI_COGs/A_NCBI_COGs.tsv
contains the protein accession, COG ID, COG description and e-value. If other relevant data is available, in the case of the COGs individual COGs are sorted into individual pathways, this information is added as well:
Exercise
Compare the files sequence_results.txt and sequence_results_red_e_cutoff.txt. What is the difference between them?
ArCOGs search
The arCOGs is a database of orthologous groups based on archaeal genomes and therefore is especially useful when working with archaea.
Possible settings to change:
- number of CPUs (default = 2)
- e-value threshold (default = 1e-3)
#prepare folder
mkdir arcogs
#run hmmsearch
hmmsearch --tblout arcogs/sequence_results.txt -o arcogs/results_all.txt --domtblout arcogs/domain_results.txt --notextw --cpu $cpus $arcog_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' arcogs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > arcogs/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g arcogs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > arcogs/All_arcogs_hmm.txt
#separate header with arcog and ID
awk -F'\t' -v OFS='\t' '{split($2,a,"."); print $1, a[1], $3,$4}' arcogs/All_arcogs_hmm.txt | LC_ALL=C sort > arcogs/temp3
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort arcogs/temp3) | LC_ALL=C sort > arcogs/temp4
#merge with arcog mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.4,2.2,1.3 <(LC_ALL=C sort -k2 arcogs/temp4) <(LC_ALL=C sort -k1 $arcog_mapping) | LC_ALL=C sort > arcogs/temp5
#add in a header
echo -e "accession\tarcogs\tarcogs_geneID\tarcogs_Description\tPathway\tarcogs_evalue" | cat - arcogs/temp5 > arcogs/C_arcogs.tsv
#combine with previous dataframe
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' arcogs/C_arcogs.tsv merge/temp_A.tsv > merge/temp_BC.tsv
#check what happened
less -S merge/temp_BC.tsv
#clean-up
rm arcogs/temp*
Now, we have in our full table not only the information for the COGs but also the arCOGs. This way we can go to every individual protein and check whether they hit to the same protein.
In general it is useful to use more then one database to analyse your genome since no database is perfect and having several databases to compare the results allows us to judge how confident we are with a certain annotation.
Unfortunately, they is no consistent naming convention. I.e. in the table above you can see for CALJACFG_00080 that the COG link this protein to a tRNA_A37_methylthiotransferase and the arCOG links this protein to a 2-methylthioadenine_synthetase but despite the different names both names actually are for the same enzyme.
Question
We now have 3 annotations for each protein from prokka, the arcogs and the COGs.
Look at the gene with ID 00260 for the genome GCA_000008085 and check if all 3 databases give consistent annotations. If not, which annotation do you trust most?
Since the prokka generates different BinIDs for each run, I can not give you the exact accession, but using grep to search for the genome and protein ID should be enough for you to find the protein.
CALJACFG_00260 hits to sulfide-quinone reductase with prokka and to a NADH_dehydrogenase with the arcogs and the COGs. Generally, the e-values are pretty good with the last two databases, while prokka by default does not give us any score to judge the confidence of this hit. We therefore should be skeptical about the annotation as sulfide quinone reductase.
KO search using Hmm’s
The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs and was generated using archaeal, bacterial and eukaryotic genomes.
Possible settings to change:
- number of CPUs
- e-value threshold
#setup directories
mkdir KO_hmm
#run KO search
hmmsearch --tblout KO_hmm/sequence_results.txt -o KO_hmm/results_all.txt --domtblout KO_hmm/domain_results.txt --notextw --cpu $cpus $ko_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' KO_hmm/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > KO_hmm/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g KO_hmm/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > KO_hmm/All_KO_hmm.txt
#merge with protein list
LC_ALL=C join -a1 -t $'\t' -j1 -e'-' -o 0,2.2,2.3,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort KO_hmm/All_KO_hmm.txt) | sort -u -k1 > KO_hmm/temp
#merge with KO_hmm names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.4,1.3,2.2,2.12 <(LC_ALL=C sort -k2 KO_hmm/temp) <(LC_ALL=C sort -k1 $ko_mapping) | LC_ALL=C sort > KO_hmm/temp3
#add in an extra column that lists whether hits have a high confidence score
awk -v OFS='\t' '{ if ($4 > $5){ $7="high_score" }else{ $7="-" } print } ' KO_hmm/temp3 > KO_hmm/temp4
#add header
echo -e "accession\tKO_hmm\te_value\tbit_score\tbit_score_cutoff\tDefinition\tconfidence" | cat - KO_hmm/temp4 > KO_hmm/M_KO_hmm.tsv
#combine with previous data frame
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' KO_hmm/M_KO_hmm.tsv merge/temp_BC.tsv > merge/temp_BCM.tsv
#control lines
wc -l merge/*
#clean up
rm KO_hmm/temp*
To keep this workflow short, we will NOT continue with the KO search but instead go to Transporter DB search
. If you want to run the full search you will have to adjust the steps that have the comment #combine with previous data frame
.
Pfam search
The Pfam database is a large collection of protein families that are generated from UniProt Reference Proteomes and includes archaea, bacteria and eukaryotes.
Possible settings to change:
- number of CPUs
- e-value threshold
#1. Setup directories
mkdir pfam
#2. run hmmsearch against all pfams
hmmsearch --tblout pfam/sequence_results.txt -o pfam/results_all.txt --domtblout pfam/domain_results.txt --notextw --cpu $cpus $pfam_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' pfam/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $4, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > pfam/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g pfam/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > pfam/All_pfam.txt
#clean pfam id
awk -F'\t' -v OFS='\t' '{split($2,a,"."); print $1,a[1], $3, $4}' pfam/All_pfam.txt > pfam/temp
#merge with protein list
LC_ALL=C join -e'-' -a1 -t $'\t' -j1 -o 0,2.2,2.3,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort pfam/temp) | sort -u -k1 > pfam/temp3
#merge with pfam names
LC_ALL=C join -a1 -1 2 -2 1 -t $'\t' -e'-' -o1.1,0,2.5,1.4 <(LC_ALL=C sort -k2 pfam/temp3) <(LC_ALL=C sort -k1 $pfam_mapping) | LC_ALL=C sort > pfam/temp5
#add in header
echo -e "accession\tPFAM_hmm\tPFAM_description\tPfam_Evalue" | cat - pfam/temp5 > pfam/E_Pfam.tsv
#combine with previous dataframe
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' pfam/E_Pfam.tsv merge/temp_BCM.tsv > merge/temp_BCME.tsv
#control lines
wc -l merge/*
#clean up
rm pfam/temp*
TIGR search
The original TIGRFAMs database was a research project of The Institute for Genomic Research (TIGR) and its successor, the J. Craig Venter Institute (JCVI) and are now maintained by NCBI. TIGRFAMs is a collection of manually curated protein families focusing primarily on prokaryotic sequences.
Possible settings to change:
- number of CPUs
- e-value threshold
#1. Setup directories
mkdir TIGRs
#2. run hmmsearch against all tigrs
hmmsearch --tblout TIGRs/sequence_results.txt -o TIGRs/results_all.txt --domtblout TIGRs/domain_results.txt --notextw --cpu $cpus $tigr_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' TIGRs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $4, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > TIGRs/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g TIGRs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > TIGRs/All_TIGR.txt
#merge with protein list
LC_ALL=C join -a1 -e'-' -t $'\t' -j1 -o 0,1.2,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort TIGRs/All_TIGR.txt) | sort -u -k1 > TIGRs/temp
#merge with tigr names
LC_ALL=C join -a1 -1 3 -2 1 -e'-' -t $'\t' -o1.1,0,2.11,2.13,1.4 <(LC_ALL=C sort -k3 TIGRs/temp) <(LC_ALL=C sort -k1 $tigr_mapping)| LC_ALL=C sort > TIGRs/temp3
#add header
echo -e "accession\tTIRGR\tTIGR_description\tEC\tTIGR_Evalue" | cat - TIGRs/temp3 > TIGRs/F_TIGR.tsv
#combine with previous data frame
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' TIGRs/F_TIGR.tsv merge/temp_BCME.tsv > merge/temp_BCMEF.tsv
#control lines
wc -l merge/*
#clean up
rm TIGRs/temp*
CazyDB search
The CAZy (Carbohydrate-Active enZYmes) database describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds. As such, this database does not consist of clusters of orthologous groups but individual protein sequences that can be linked to CAZys.
Possible settings to change:
- number of CPUs
- e-value threshold (default = 1e-10). We are a bit stricter here, because the database is smaller –> so the chance of false-positives is higher.
#make folder
mkdir CAZYmes
#2. run hmmsearch against the CAZy database
hmmsearch --tblout CAZYmes/sequence_results.txt -o CAZYmes/results_all.txt --domtblout CAZYmes/domain_results.txt --notextw --cpu $cpus $cazy_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' CAZYmes/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-10' > CAZYmes/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g CAZYmes/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > CAZYmes/All_vs_CAZYmes.txt
#clean things and remove .hmm from the file
sed s'/\.hmm//g' CAZYmes/All_vs_CAZYmes.txt > CAZYmes/temp
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort CAZYmes/temp) | LC_ALL=C sort > CAZYmes/temp3
#merge with mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.3,2.2 <(LC_ALL=C sort -k2 CAZYmes/temp3) <(LC_ALL=C sort $cazy_mapping) | LC_ALL=C sort > CAZYmes/temp4
#add in headers
echo -e "accession\tCAZy\tCAZy_evalue\tDescription" | cat - CAZYmes/temp4 > CAZYmes/G_CAZy.tsv
#combine with previous dataframe
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' CAZYmes/G_CAZy.tsv merge/temp_BCMEF.tsv > merge/temp_BCMEFG.tsv
#control lines
wc -l merge/*
#clean up
rm CAZYmes/temp*
Transporter DB search
After the arCOG search, we start here. If you ran the other database searches, you need to change the temp_BC.tsv in the part #combine with previous dataframe. Afterwards, we will NOT run the IPRscan and the diamond search but continue at Modify final dataframe
The transporter classification database (also tcdb) details a comprehensive classification system for membrane transport proteins known as the Transporter Classification (TC) system. The database consists of individual protein sequences linked to this classification system.
Possible settings to change:
- number of CPUs (default setting for -num_threads = 2)
- e-value threshold (default setting for -evalue = 1e-10) We are a bit stricter here, because the database is smaller –> so the chance of false-positives is higher.
#make folder
mkdir TransporterDB
#2. run protein blast against all TransporterDB
blastp -num_threads $cpus -outfmt 6 -query faa/Prokka/All_Genomes_clean.faa -db $transporter_db -out TransporterDB/All_vs_TPDB.tsv -evalue 1e-10
#get best hit based on bit score, and then e-value
sort -t$'\t' -k12,12gr -k11,11g TransporterDB/All_vs_TPDB.tsv | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k12,12gr -k11,11g > TransporterDB/temp
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.11 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort TransporterDB/temp) | LC_ALL=C sort > TransporterDB/temp3
#merge with mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.3,2.3 <(LC_ALL=C sort -k2 TransporterDB/temp3) <(LC_ALL=C sort -k1 $transporter_mapping) | LC_ALL=C sort > TransporterDB/temp4
#add in headers
echo -e "accession\tTCBD_ID\tTCBD_evalue\tTCBD_description" | cat - TransporterDB/temp4 > TransporterDB/I_TPDB.tsv
#combine with previous dataframe
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' TransporterDB/I_TPDB.tsv merge/temp_BCM.tsv > merge/temp_BCMEFGHI.tsv
#control lines
wc -l merge/*
#clean up
rm TransporterDB/temp*
HydDB search
The hydrogenase database (HydDB) contains individual protein sequences from hydrogenases and is a curated database of hydrogenases by known type.
Possible settings to change:
- number of CPUs (default setting for -num_threads = 2)
- e-value threshold (default setting for -evalue = 1e-10) We are a bit stricter here, because the database is smaller –> so the chance of false-positives is higher.
#make folder
mkdir HydDB
#run search
blastp -num_threads $cpus -outfmt 6 -query faa/Prokka/All_Genomes_clean.faa -db $hyd_db -out HydDB/All_vs_HydDB.tsv -evalue 1e-10
#get best hit based on bit score, and then e-value
sort -t$'\t' -k12,12gr -k11,11g HydDB/All_vs_HydDB.tsv | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k12,12gr -k11,11g > HydDB/temp
#merge with contig list; control: grep for 05160
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.3,2.11,2.12 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort HydDB/temp) | LC_ALL=C sort > HydDB/temp2
#merge with HydDB names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.2,1.4 <(LC_ALL=C sort -k2 HydDB/temp2) <(LC_ALL=C sort -k1 $hyd_mapping) | LC_ALL=C sort > HydDB/temp3
#add in headers
echo -e "accession\tHydDB\tDescription\tHydDB_evalue" | cat - HydDB/temp3 > HydDB/J_HydDB.tsv
#combine with previous dataframe
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' HydDB/J_HydDB.tsv merge/temp_BCMEFGHI.tsv > merge/temp_BCMEFGHIJ.tsv
#control lines
wc -l merge/*
#clean up
rm HydDB/temp*
IPRscan –> done
InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites, such as transmembrane motives (TMs) or signal proteins.
Both IPRscan and the Diamonds search against NCBI_nr will take quite some time. So have patience if you run this on your own.
genome cleanup
IPRscan does not like to deal with *
in the protein files, which is used for stop codons. If we work with programs that don’t recognize *
, we should remove this symbol first.
#interproscan does not like if we have stars in our protein files, so we need to remove them
cd faa/Prokka
sed 's/*//g' All_Genomes_clean.faa > All_Genomes_clean_interproscan.faa
cd ../..
Before running IPRscan, we can split our big concatenated file into several smaller files. Then we can use parallel GNU to speed things up.
We set the number of processes to run with -j
. Always make sure that you adjust this depending on the available resources.
run iprscan
Possible settings to change:
- Number of files we split the concatenated fasta file into via Split_Multifasta.py (default of -n = each split file will contain 1000 sequences)
- number of parallel jobs to run via parallel (default setting for -j= 4)
- e-value threshold (default setting for -evalue = 1e-20)
mkdir IPRscan
#split files --> 2069 proteins were split into 3 files
python $python_scripts/Split_Multifasta.py -m faa/Prokka/All_Genomes_clean_interproscan.faa -n 1000
#move files into a better location
mkdir split_faa
mv File*.faa split_faa
#iprscan: (interproscan-5.31-70.0)
#"cite:Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47."
parallel -j3 'i={}; interproscan.sh -i $i -d IPRscan/ -appl TIGRFAM,SFLD,SUPERFAMILY,Gene3D,Hamap,Coils,ProSiteProfiles,SMART,CDD,PRINTS,ProSitePatterns,Pfam,ProDom,MobiDBLite,PIRSF,TMHMM -T IPRscan/temp --iprlookup --goterms -cpu 5' ::: split_faa/File*.faa
#if parallel is not available on your system, use a for-loop
#for sample in split_faa/File*.faa; do interproscan.sh -i $sample -d IPRscan/ -appl TIGRFAM,SFLD,SUPERFAMILY,Gene3D,Hamap,Coils,ProSiteProfiles,SMART,CDD,PRINTS,ProSitePatterns,Pfam,ProDom,MobiDBLite,PIRSF,TMHMM -T IPRscan/temp --iprlookup --goterms -cpu 5; done
#Concat result files
cd IPRscan
mkdir single_files
mv File* single_files
mkdir Concat_results/
cat single_files/File*.faa.xml > Concat_results/All_bins_iprscan-results.xml
cat single_files/File*.faa.gff3 > Concat_results/All_bins_bins_iprscan-results.gff3
cat single_files/File*.faa.tsv > Concat_results/All_bins_bins_iprscan-results.tsv
#cleanup
rm single_files/File*
cd ..
#clean up fasta header so that it is exactly the same as the accession ID in the interproscan results
python $python_scripts/parse_IPRdomains_vs2_GO_2_ts_sigP.py -s faa/Prokka/All_Genomes_clean.faa -i IPRscan/Concat_results/All_bins_bins_iprscan-results.tsv -o IPRscan/All_bins_bins_iprscan-results_parsed.tsv
#remove issue with spaces
sed 's/ /_/g' IPRscan/All_bins_bins_iprscan-results_parsed.tsv | LC_ALL=C sort > IPRscan/temp.txt
#print only columns of interest
awk -F'\t' -v OFS="\t" '{print $1, $2,$3,$4, $5, $8, $9}' IPRscan/temp.txt > IPRscan/temp2
#add header
echo -e "accession\tIPR\tIPRdescription\tIPR_PFAM\tIPR_PFAMdescription\tTMHMM\tSignalP" | cat - IPRscan/temp2 > IPRscan/K_IPR.tsv
#combine with previous data frame
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' IPRscan/K_IPR.tsv merge/temp_BCMEFGHIJ.tsv > merge/temp_BCMEFGHIJK.tsv
#control lines
wc -l merge/*
#clean up
rm IPRscan/temp*
Diamond against NCBI NR
Possible settings to change:
- number of CPUs (–threads)
- e-value threshold (–evalue)
- Sequence numbers to display (–seq)
Notice:
- This step can take quite some time.
- the NCBI nr database is constantly updated (and very large), which is why we are not providing the database with the other files. For information on how to download ncbi_nr check out the information here.
#make folder
mkdir NCBI_NR
#run diamond against NR database
diamond blastp -q faa/Prokka/All_Genomes_clean.faa --more-sensitive --evalue 1e-3 --threads $cpus --seq 50 --db $ncbi_nr_db --taxonmap $taxon_db --outfmt 6 qseqid qtitle qlen sseqid salltitles slen qstart qend sstart send evalue bitscore length pident staxids -o NCBI_NR/All_NCBInr.tsv
#Select columns of interest in diamond output file
awk -F'\t' -v OFS="\t" '{ print $1, $5, $6, $11, $12, $14, $15 }' NCBI_NR/All_NCBInr.tsv | sed 's/ /_/g' > NCBI_NR/temp
#Parse Diamnond Results
python $python_scripts/parse_diamond_blast_results_id_taxid.py -i FileLists/AllProteins_list.txt -d NCBI_NR/temp -o NCBI_NR/temp2
#rm header
sed 1d NCBI_NR/temp2 > NCBI_NR/temp3
#add an '-' into empty columns
awk -F"\t" '{for(i=2;i<=NF;i+=2)gsub(/[[:blank:]]/,"_",$i)}1' OFS="\t" NCBI_NR/temp3 > NCBI_NR/temp4
#the python script above sometimes leaves an empty 7th column, this gets rid of that issue
awk -F'\t' -v OFS="\t" '{if (!$7) {print $1,$2, $4 , $6, "-"} else {print $1, $2, $4, $6, $7}}' NCBI_NR/temp4 | LC_ALL=C sort > NCBI_NR/temp5
#split columns with two tax ids
awk -F'\t' -v OFS='\t' '{split($5,a,";"); print $1, $2, $3, $4, a[1]}' NCBI_NR/temp5 > NCBI_NR/temp6
#in column 2 remove everything after < (otherwise the name can get too long)
awk -F'\t' -v OFS='\t' '{split($2,a,"<"); print $1, a[1], $3, $4, $5}' NCBI_NR/temp6 > NCBI_NR/temp6a
#merge with tax names
LC_ALL=C join -a1 -1 5 -2 1 -e'-' -t $'\t' -o1.1,1.2,1.3,1.4,1.5,2.2 <(LC_ALL=C sort -k5 NCBI_NR/temp6a) <(LC_ALL=C sort -k1 ~/../spang_team/Databases/NCBI_taxonomy/Jul2019/taxonomy5.txt ) | LC_ALL=C sort > NCBI_NR/temp7
#add in header
echo -e "accession\tTopHit\tE_value\tPecID\tTaxID\tTaxString" | cat - NCBI_NR/temp7 > NCBI_NR/L_Diamond.tsv
#combine with previous dataframe
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' NCBI_NR/L_Diamond.tsv merge/temp_BCMEFGHIJK.tsv > merge/temp_BCMEFGHIJKL.tsv
#control lines
wc -l merge/*
#clean up
rm NCBI_NR/temp*
Modify final dataframe
Now, lets cleanup the final dataframe and then we made it!
Basically, we do several cosmetic steps. I.e. for each database search we get a column with the accessionID, in the second step, we remove the duplicated columns to slim the data down.
In this trimmed workflow, we use the file merge/temp_BCMEFGHI.tsv
from the transporter database search. If you ran the full workflow you will have can replace this file with merge/temp_BCMEFGHIJKL.tsv
in the first line of code
#duplicate first column and add new header for this column
awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$0}' merge/temp_BCMEFGHI.tsv | sed '1s/accession/for_merge/' > merge/temp_0.tsv
#rm redundant headers and reconstruct accession ID
awk 'BEGIN{FS="\t";OFS="\t"}NR==1{for(x=1;x<=NF;x++)if($x!="accession")l[x]++;}{for(i=1;i<=NF;i++)if(i in l)printf (i==NF)?$i"":$i"\t";printf "\n"}' merge/temp_0.tsv > merge/temp_1.tsv
#remove unknown symbols (these symbols might come from some of the mapping files)
tr -cd '\11\12\15\40-\176' < merge/temp_1.tsv > merge/temp_2.tsv
#change column name
awk 'BEGIN{FS="\t";FS="\t"; OFS="\t"}{if(NR==1) $1="accession"} {print $0 }' merge/temp_2.tsv > merge/Annotations.txt
#clean up
rm merge/temp_[0-9].tsv
gzip merge/*tsv
Question/Homework –> Let’s figure out what Igniccous is doing
- Download the table
Annotations.txt
to your personal computer using scp. - Compare the different annotations, have a look at how they differ.
- Make one separate table for each of the two genomes
- For Ingicoccus (GCA_000017945), copy the accession numbers and KO ids into a new file and upload or copy and paste them to the KEGG mapper. Use this tool to get some first insights into the metabolic capacities of that organism.
Based on this try to answer:
Does Igni use glycolysis or gluconeogenesis (or both)?
This question allows us to figure out whether our organism might use carbon sources like sugars (via glycolysis, if we have this we might look further into whether we have a fermenting organism) or uses acetyl-CoA to make their own sugars, which could be then used in other pathways (gluconeogenesis).Is Igni likely an anaerobic or aerobic organism?
This question can help us to decide what growth conditions we need and whether O2 is required for our organism to generate energy.Is Ingi and autotroph or heterotroph?
Again, this question is important if we want to know how to grow our organism, do we need to feed it sugars so that it grows, or can it grow on CO2. If it is involved in CO2 and/or methane cycling it might also have an impact in global carbon cycles.Can Igni reduce sulfur?
If Igni grows on sulfur, we could add sulfur to a potential growth medium. Plus, for the environmental side of things we want to look into lithotrophy and how organism in our environment are involved in nitrogen, sulfur, metal or hydrogen cycling.If Igni reduces sulfur, what does it oxidize?
Same as above.
The Input for the website should look something like this:
Answer for Does Igni use glycolysis or gluconeogenesis?
Answer for Is Igni likely an anaerobic or aerobic organism?
Answer for Is Ingi and autotroph or heterotroph?
Answer for Can Igni reduce sulfur?
Answer for If Igni reduces sulfur, what does it oxidize?
What next?
Depending on the number of genomes we have the final table will get rather large. Therefore, it makes a lot of sense to condense the information. The following ideas will not be part of this tutorial but will give you some pointers on how do to that:
R magic
We provide a R tutorial here. This tutorial starts with basics in R but ends with an example to deal with the annotation tables here. More specifically, it will show you how to:
- Make count tables for each genome, i.e. for MAG1 count how often we find arCOG0001. This calculation is done each genome that is part of your analysis and thus provides you with more condensed table
- Condense the data on the same on taxonomic or custom levels. I.e. the example in the R tutorial is for ~30 genomes of the same phylum. However, the genomes come from different environments. Using the tutorial you will be able to ask, how many of the aquifer genomes have arCOG0001? How many of the marine genomes have this gene?
Python magic
This section will be extended once the author of this tutorial has time. However, once you have to deal with 5000 genomes python becomes much more attractive to do the parsing esp. compared to RStudio.
Background info on the used tools and databases
Here, you will some background info on the used approaches, databases and tools. The idea is to provide links to the manuals and some other information that might (or might not) be useful.
Note, this is an on-going part of this document, so it might change over time.
Gene predictions
There are several tools to identify genes in our genomes:
- Prodigal: a prokaryotic gene recognition and translation initiation site identification tool. Hyatt et al., 2010
- Prokka: A rapid prokaryotic genome annotation tool. Seeman 2014 This tool is based on prodigal but does some basic genome annotations.
Something important to keep in mind is that these gene prediction tools generally used for genomes from archaea and bacteria. If you are interested in eukaryotic genomes, things like splicing and alternative reading frames make gene prediction more tricky and there are special tools out there that are more suited for Eukaryotes.
Information on the different search tools
BLAST
BLAST (Basic Local Alignment Search Tool) finds regions of local similarity between protein or nucleotide sequences (Altschul 1990).
NCBI provides a quick tutorial explaining its general usage and you can find some more details in the algorithm and usage of the website here.
There are different versions of blast that you can use depending on what comparisons you want to make.
- blastn = nucleotide blast searches with a nucleotide “query” against a database of nucleotide “subject” sequences.
- blastp = protein blast searches with a protein “query” against a database of protein “subject” sequences.
- translated blasts searches use a genetic code to translate either the “query,” database “subjects,” or both, into protein sequences, which are then aligned as in “blastp.”
- tblastn = a protein sequence query is compared to the six-frame translations of the sequences in a nucleotide database.
- blastx = a nucleotide sequence query is translated in six reading frames, and the resulting six-protein sequences are compared, in turn, to those in a protein sequence database.
- tblastx = both the “query” and database “subject” nucleotide sequences are translated in six reading frames, after which 36 (6 × 6) protein “blastp” comparisons are made.
Additionally, we can also work with alignments using psi-blast = a protein sequence profile search method that builds off the alignments generated by a run of the BLASTp program.
Diamond
Diamond is an alternative to BLAST for protein and translated DNA searches (Buchfink et al., 2015) If you check the diamond website you will find that some advantages over blast are listed:
- Pairwise alignment of proteins and translated DNA at 100x-20,000x speed of BLAST.
- Frameshift alignments for long read analysis.
- Low resource requirements and suitable for running on standard desktops or laptops.
- Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.
In principle diamond works very similar compared to blast and uses the same input files.
HMMER
We also use HMMER for searching databases and generating alignments (but generally we prefer to use mafft-linsi for alignments ourselves).
The approach taken by HMMER is quite different from blast and diamond as HMMER implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
If you look at some HMM databases used in this tutorial with ie. less arcog.hmm
, you will see that the files look very different compared to what we use as input for blast and diamond. One of the biggest differences is that the profiles used in HMMER searches are build from multiple sequence alignments.
For more information you can check the HMMER website or the hmmer paper
Information on the different databases
COGs
COG = Clusters of Orthologous Groups.
The COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages.
The different COGs are characterized into different pathways and the pathway description can be found here
Some papers you might want to check if you want to know more:
arCOGs
arCOG = Archaeal Clusters of Orthologous Genes. arCOGs are generated in a similar manner as the COGs, just with a focus on archaeal genomes.
Some papers you might want to check if you want to know more:
KOs and KEGG
KO = KEGG ORTHOLOGY
The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs. By using the HMMs we assign K numbers to our proteins by HMMER/HMMSEARCH against the KOfam database.
Details on how the HMMs and cutoffs were generated can be found here
Something useful to know is that all KOs, ie. K00161, are linked to metabolic maps that have quite good descriptions. For example, if you would google K00161 you will come across this website that gives:
- a detailed description of what gene we have
- if applicable the enzyme number (EC number)
- The pathways this gene can be involved in. If you click on the pathway you see its exact position in a pathway
- A paper that first describes the gene
If you click on the enzyme number (EC 1.2.4.1) you will get other useful information
- alternative names
- the “chemical” class it belongs to
- the chemical reaction
- useful comments about the reaction and whether it is part of an enzyme complex
- if it is an enzyme complex, we get the KOs for the other subunits
Biocyc: Alternative to KEGG to see how a gene is linked to a pathway
If you want to find out more about a gene, you can check also metacyc. I.e. if we search for "pyruvate dehydrogenase" biocyc
in goggle we should land here
Similar as with KEGG, we get some basic information about this enzyme, such as:
- the different subunits
- a basic description of what that enzyme does
- the reaction, incl. the direction
- Functions of the other subunits
- some paper references
PFAM
PFAM is a database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
TIGRFAMs
The TIGRFAMs database is a resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins.
CAZy
CAZy = Carbohydrate-Active enZYmes
A database dedicated to the display and analysis of genomic, structural and biochemical information on Carbohydrate-Active Enzymes (CAZymes).
Resources:
Transporter DB
This is the website that gives more information about potential transporters found in our search.
HydDB
HydDB = Hydrogenase database
Resources:
- The original paper
- The website we originally got the sequences from. Also provides more detailed information on the different hydrogenase subcategories.
IPRscan
Short for InterProScan
Interpro allows for the classification of protein families and predicting domains and important sites.
If we see an ID like this IPR000013
, this is a so called interpro domain that might be found on your protein. Notice, a protein can have more than one domain.
Resources:
- The website from which to download all information
- The original paper, which provides also an overview of the different databases integrated with InterPro.
NCBI-nr
As we have seen above, NCBI (the National Center for Biotechnology Information) provides a lot of tools, such as blast, and databases, such as the COGs and arCOGs. Additionally, it is one of the largest repositories of sequences from metagenomes, amplicon libraries and genomes.
The nr database (here: ncbi_nr) is compiled by the NCBI as a protein database for Blast searches.It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF and is frequently updated. However, it is a quite large database making searches more time consuming.
Uniprot - checking your results
Another good database, next to KEGG and metacyc, to check for information for proteins is Uniprot. The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.
We can search for:
- protein names, i.e. “pyruvate dehydrogenase”
- proteins from certain taxa using the taxonomy option, i.e.
taxonomy:bacteria pyruvate dehydrogenase
- KOs, arCOGs, etc …
I.e. if we would search for taxonomy:bacteria "pyruvate dehydrogenase e1"
we should get:
- A list of all proteins classified as E1 subunit of the pyruvate dehydrogenase in bacteria
- some of these sequences are reviewed (i.e. might have some more supporting info) and others are unreviewed (annotation only based on sequence similarity)
- a list of taxa that have this protein (under view by click on taxonomy) and then we can see in exactly what bacteria these enzyme is found
If we click on an individual example, i.e. from E.coli, we get:
- Details on the reaction
- Publications about this specific enzyme in E.coli
- Co-factors
- Information about potential functions and biological processes
- a reference to MetaCyc
- The links to several databases (i.e. protein databases, KEGG, IPR domains, etc.)