# Phylogenetic tree for the tRNA project (thermococcus set)

We will generate a tree with a set of complete thermococcus genomes used for tRNA studies before joining the project and will complement this 3 genomes as outgroup. Notice; we ran this before using the same set of thermococcus genomes, but different genomes as an outgroup. We changed the members for the outgroup to be more taxonomically related to exclude any tree artefacts.



## set working environment

Notice: If you run this yourself, change the path accordingly.

```bash
wdir="/path/to_wdir"

#path to database to find markers of interest (available here: https://zenodo.org/record/3839790#.YsPgSHBBxcA)
hmm_db="/path/52_tested_markers/52tested_inclTIGR.hmm"

#list of markers used for phylogenetic analyses
undin_markers="/path//52_markers_list"

#path to concatenated genome files
faa_path="/path/All_Genomes.faa "

cd $wdir
```



## Prepare file names


```bash
mkdir faa

#get thermococcus from the first version
cp ../v1/faa/single/GCF_00* faa/

#rm old outgroup genomes
rm faa/GCF_000017165.1.faa
rm faa/GCF_000214415.1.faa
rm faa/GCF_000091665.1.faa

#get three new outgroup genomes
cp ../../faa/single/GCF_000017165.1.faa faa/
cp ../../faa/single/GCF_002945325.1.faa faa/
cp ../../faa/single/GCF_000017185.1.faa faa/

#we work with so many genomes: 23
ll faa/*faa | wc -l

cd ..

#cleanup
mkdir faa/single
cp faa/*faa faa/single

#combine data
cat faa/single/*faa > faa/All_Genomes.faa
```


## Search for marker proteins by first running hmmsearch


```bash
mkdir Hmmersearch

#run search (output from this search is provided)
hmmsearch --tblout Hmmersearch/sequence_results.txt -o Hmmersearch/results_all.txt --domtblout Hmmersearch/domain_results.txt --notextw --cpu 20 /export/lv1/user/spang_team/Databases/52_tested_markers/52tested_inclTIGR.hmm faa/All_Genomes.faa

#format the full table and only select sequences above a certain evalue
sed 's/ \+ /\t/g' Hmmersearch/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > Hmmersearch/sequence_results_red_e_cutoff.txt

#get best hit based on bit score, and then evalue (in case two sequences have the same bitscore)
sort -t$'\t' -k3,3gr -k4,4g Hmmersearch/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k3,3gr -k4,4g >  Hmmersearch/All_NCBI_COGs_hmm.txt
```




## Prepare file lists

```bash
mkdir FileLists
cp /export/lv4/user/spang_team/Collaborations/tRNAs/Phylogeny/Overlapping_Set/v1/FileLists/52_markers_list_edited* FileLists/

#manually remove TIGR00335, TIGR00483 from these Lists!!!!! These markers have too many duplicates

#add GFC numbers from excel sheet into a file
nano FileLists/GenomeList

#cp the list of markers into a variable
undin_markers="/export/lv4/user/spang_team/Collaborations/tRNAs/Phylogeny/Overlapping_Set/v2/FileLists/52_markers_list_edited"
```




## Extract COGs of interest from previous search

```bash
#extract cog data for genomes of interest
fgrep -f FileLists/GenomeList Hmmersearch/All_NCBI_COGs_hmm.txt  > Hmmersearch/cogs_Relevant_taxa_subset.txt

#control that we got the data for all genomes --> 23
awk -F'\t' -v OFS='\t' '{split($1,a,"-")}{print a[1]}' Hmmersearch/cogs_Relevant_taxa_subset.txt | sort | uniq | wc -l

#duplicate column1 (for cosmetics and easier searching later)
awk -F'\t' -v OFS='\t' '{print $0, $1}'  Hmmersearch/cogs_Relevant_taxa_subset.txt > Hmmersearch/temp

#separate the elife marker genes into indiv. files
mkdir -p FileLists/single_markers

for sample in `cat  $undin_markers`; do grep "$sample" Hmmersearch/temp > FileLists/single_markers/${sample}.txt; done

#check that the nr of markers is good --> 43
ll FileLists/single_markers/*txt | wc -l

#cut after first column to get only binID
mkdir FileLists/split

#clean header
for sample in `cat  $undin_markers`; do awk -F'\t' -v OFS='\t' '{split($1,a,"-"); print a[1], $2, $3, $4, $5, $6, $7, $8 }' FileLists/single_markers/$sample* > FileLists/split/$sample |LC_ALL=C  sort ; done

#check counts (here make sure that the max more or less makes sense)
wc -l FileLists/split/*
```

From 23 genomes, this is how many sequences we have/marker:


   23 FileLists/split/gtdb_PF00466.15
   23 FileLists/split/gtdb_PF00687.16
   23 FileLists/split/gtdb_PF00827.12
   23 FileLists/split/gtdb_PF00900.15
   23 FileLists/split/gtdb_PF01000.21
   23 FileLists/split/gtdb_PF01015.13
   23 FileLists/split/gtdb_PF01090.14
   23 FileLists/split/gtdb_PF01157.13
   23 FileLists/split/gtdb_PF01200.13
   23 FileLists/split/gtdb_PF01655.13
   23 FileLists/split/gtdb_PF01798.13
   23 FileLists/split/gtdb_PF07541.7
   23 FileLists/split/gtdb_PF13685.1
   23 FileLists/split/OG525.
   23 FileLists/split/PF00410.14
   23 FileLists/split/PF00673
   29 FileLists/split/TIGR00037
   32 FileLists/split/TIGR00064 **
   23 FileLists/split/TIGR00111
   23 FileLists/split/TIGR00279
   23 FileLists/split/TIGR00291
   30 FileLists/split/TIGR00373 **
   23 FileLists/split/TIGR00405
   23 FileLists/split/TIGR00448
   23 FileLists/split/TIGR00491
   23 FileLists/split/TIGR00501
   23 FileLists/split/TIGR00967
   23 FileLists/split/TIGR00982
   23 FileLists/split/TIGR01008
   23 FileLists/split/TIGR01012
   23 FileLists/split/TIGR01020
   23 FileLists/split/TIGR01028
   23 FileLists/split/TIGR01171
   23 FileLists/split/TIGR01425
   23 FileLists/split/TIGR02389
   22 FileLists/split/TIGR02390
   23 FileLists/split/TIGR03626
   23 FileLists/split/TIGR03628
   23 FileLists/split/TIGR03629
   23 FileLists/split/TIGR03670
   23 FileLists/split/TIGR03673
   23 FileLists/split/TIGR03680
   25 FileLists/split/TIGR03722



## remove contaminations

We removed duplicated sequences in the previous workflow (with the old taxon set) and we will make use of this step here


```bash
#get the list of genomes we removed in v1
cp ../v1/FileLists/proteins_to_remove FileLists/

mkdir FileLists/split_cleaned

#remove proteins from our lists
for sample in `cat  $undin_markers`; do fgrep -v -f FileLists/proteins_to_remove FileLists/split/$sample > FileLists/split_cleaned/$sample ; done

#control counts
wc -l FileLists/split_cleaned/*
```



   23 FileLists/split_cleaned/gtdb_PF00466.15
   23 FileLists/split_cleaned/gtdb_PF00687.16
   23 FileLists/split_cleaned/gtdb_PF00827.12
   23 FileLists/split_cleaned/gtdb_PF00900.15
   23 FileLists/split_cleaned/gtdb_PF01000.21
   23 FileLists/split_cleaned/gtdb_PF01015.13
   23 FileLists/split_cleaned/gtdb_PF01090.14
   23 FileLists/split_cleaned/gtdb_PF01157.13
   23 FileLists/split_cleaned/gtdb_PF01200.13
   23 FileLists/split_cleaned/gtdb_PF01655.13
   23 FileLists/split_cleaned/gtdb_PF01798.13
   23 FileLists/split_cleaned/gtdb_PF07541.7
   23 FileLists/split_cleaned/gtdb_PF13685.1
   23 FileLists/split_cleaned/OG525.
   23 FileLists/split_cleaned/PF00410.14
   23 FileLists/split_cleaned/PF00673
   23 FileLists/split_cleaned/TIGR00037
   23 FileLists/split_cleaned/TIGR00064
   23 FileLists/split_cleaned/TIGR00111
   23 FileLists/split_cleaned/TIGR00279
   23 FileLists/split_cleaned/TIGR00291
   23 FileLists/split_cleaned/TIGR00373
   23 FileLists/split_cleaned/TIGR00405
   23 FileLists/split_cleaned/TIGR00448
   23 FileLists/split_cleaned/TIGR00491
   23 FileLists/split_cleaned/TIGR00501
   23 FileLists/split_cleaned/TIGR00967
   23 FileLists/split_cleaned/TIGR00982
   23 FileLists/split_cleaned/TIGR01008
   23 FileLists/split_cleaned/TIGR01012
   23 FileLists/split_cleaned/TIGR01020
   23 FileLists/split_cleaned/TIGR01028
   23 FileLists/split_cleaned/TIGR01171
   23 FileLists/split_cleaned/TIGR01425
   23 FileLists/split_cleaned/TIGR02389
   22 FileLists/split_cleaned/TIGR02390
   23 FileLists/split_cleaned/TIGR03626
   23 FileLists/split_cleaned/TIGR03628
   23 FileLists/split_cleaned/TIGR03629
   23 FileLists/split_cleaned/TIGR03670
   23 FileLists/split_cleaned/TIGR03673
   23 FileLists/split_cleaned/TIGR03680
   23 FileLists/split_cleaned/TIGR03722



## Extract marker proteins

Notice, these markers can have multi-copy proteins, therefore we need to remove this at a later point.

```bash
#get list of proteins to extract from faa file
mkdir FileLists/protein_list_non_dedup

for sample in `cat  $undin_markers`; do awk -F'\t' -v OFS='\t' '{print $5 }' FileLists/split_cleaned/$sample* > FileLists/protein_list_non_dedup/$sample |LC_ALL=C  sort ; done

#extract faa sequences
mkdir -p Marker_Genes/non_dedup

for sample in `cat  $undin_markers`; do perl ~/../spang_team/Scripts/Others/screen_list_new.pl FileLists/protein_list_non_dedup/$sample $faa_path  keep > Marker_Genes/non_dedup/${sample}.faa; done

#control that all the names you expect are there --> 23
grep ">" Marker_Genes/non_dedup/*faa | cut -f2 -d ">" | cut -f1 -d "-" | sort | uniq > Names.txt
wc -l Names.txt

#count proteins per sample
grep -c ">" Marker_Genes/non_dedup/*faa
```

Count number of proteins/marker:

Marker_Genes/non_dedup/gtdb_PF00466.15.faa:23
Marker_Genes/non_dedup/gtdb_PF00687.16.faa:23
Marker_Genes/non_dedup/gtdb_PF00827.12.faa:23
Marker_Genes/non_dedup/gtdb_PF00900.15.faa:23
Marker_Genes/non_dedup/gtdb_PF01000.21.faa:23
Marker_Genes/non_dedup/gtdb_PF01015.13.faa:23
Marker_Genes/non_dedup/gtdb_PF01090.14.faa:23
Marker_Genes/non_dedup/gtdb_PF01157.13.faa:23
Marker_Genes/non_dedup/gtdb_PF01200.13.faa:23
Marker_Genes/non_dedup/gtdb_PF01655.13.faa:23
Marker_Genes/non_dedup/gtdb_PF01798.13.faa:23
Marker_Genes/non_dedup/gtdb_PF07541.7.faa:23
Marker_Genes/non_dedup/gtdb_PF13685.1.faa:23
Marker_Genes/non_dedup/OG525..faa:23
Marker_Genes/non_dedup/PF00410.14.faa:23
Marker_Genes/non_dedup/PF00673.faa:23
Marker_Genes/non_dedup/TIGR00037.faa:23
Marker_Genes/non_dedup/TIGR00064.faa:23
Marker_Genes/non_dedup/TIGR00111.faa:23
Marker_Genes/non_dedup/TIGR00279.faa:23
Marker_Genes/non_dedup/TIGR00291.faa:23
Marker_Genes/non_dedup/TIGR00373.faa:23
Marker_Genes/non_dedup/TIGR00405.faa:23
Marker_Genes/non_dedup/TIGR00448.faa:23
Marker_Genes/non_dedup/TIGR00491.faa:23
Marker_Genes/non_dedup/TIGR00501.faa:23
Marker_Genes/non_dedup/TIGR00967.faa:23
Marker_Genes/non_dedup/TIGR00982.faa:23
Marker_Genes/non_dedup/TIGR01008.faa:23
Marker_Genes/non_dedup/TIGR01012.faa:23
Marker_Genes/non_dedup/TIGR01020.faa:23
Marker_Genes/non_dedup/TIGR01028.faa:23
Marker_Genes/non_dedup/TIGR01171.faa:23
Marker_Genes/non_dedup/TIGR01425.faa:23
Marker_Genes/non_dedup/TIGR02389.faa:23
Marker_Genes/non_dedup/TIGR02390.faa:22
Marker_Genes/non_dedup/TIGR03626.faa:23
Marker_Genes/non_dedup/TIGR03628.faa:23
Marker_Genes/non_dedup/TIGR03629.faa:23
Marker_Genes/non_dedup/TIGR03670.faa:23
Marker_Genes/non_dedup/TIGR03673.faa:23
Marker_Genes/non_dedup/TIGR03680.faa:23
Marker_Genes/non_dedup/TIGR03722.faa:23



## Calculate nr of sequences to remove

```bash 
cd Marker_Genes/non_dedup

#count number genomes with duplicates
for sample in *faa; do grep ">" ${sample}  | cut -f1 -d "-" | sort | uniq -d | cat <(echo $sample) <(wc -l) | pr -T -2; done > temp1

#count total number of duplicates
for sample in *faa; do grep ">" ${sample}  | cut -f1 -d "-" | sort | uniq -d -c | awk -v var1=$sample '{sum += $1 - 1} END {print var1, sum}' ; done > temp2

#combine 
awk 'FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' temp2 temp1 | awk -v OFS="\t" '{print $1, $2, $4}' | cat <(echo -e 'MarkerID\tNr_dup_genomes\tNr_dup_genes') - 

rm temp*

cd ../..
```



MarkerID        Nr_dup_genomes  Nr_dup_genes
gtdb_PF00466.15.faa     0
gtdb_PF00687.16.faa     0
gtdb_PF00827.12.faa     0
gtdb_PF00900.15.faa     0
gtdb_PF01000.21.faa     0
gtdb_PF01015.13.faa     0
gtdb_PF01090.14.faa     0
gtdb_PF01157.13.faa     0
gtdb_PF01200.13.faa     0
gtdb_PF01655.13.faa     0
gtdb_PF01798.13.faa     0
gtdb_PF07541.7.faa      0
gtdb_PF13685.1.faa      0
OG525..faa      0
PF00410.14.faa  0
PF00673.faa     0
TIGR00037.faa   0
TIGR00064.faa   0
TIGR00111.faa   0
TIGR00279.faa   0
TIGR00291.faa   0
TIGR00373.faa   0
TIGR00405.faa   0
TIGR00448.faa   0
TIGR00491.faa   0
TIGR00501.faa   0
TIGR00967.faa   0
TIGR00982.faa   0
TIGR01008.faa   0
TIGR01012.faa   0
TIGR01020.faa   0
TIGR01028.faa   0
TIGR01171.faa   0
TIGR01425.faa   0
TIGR02389.faa   0
TIGR02390.faa   0
TIGR03626.faa   0
TIGR03628.faa   0
TIGR03629.faa   0
TIGR03670.faa   0
TIGR03673.faa   0
TIGR03680.faa   0
TIGR03722.faa   0





### Calculate avg protein length and protein nr

```bash
cd Marker_Genes/non_dedup/

for sample in `cat $undin_markers`; do perl ~/../spang_team/Scripts/Others/length+GC.pl ${sample}* | awk -F'\t' -v var1=$sample '{sum+=$3 }END { print var1 , sum/NR }' ; done

#also generate a list with the sequence lenght of everything
for sample in `cat $undin_markers`; do perl ~/../spang_team/Scripts/Others/length+GC.pl ${sample}* | awk -F'\t' -v var1=$sample '{ print var1 , $1, $3 }' ; done > Markers_lengths.txt

grep -c ">" *faa

cd ../..
```

Avg protein length:

OG525. 590.261
PF00410.14 130
gtdb_PF00466.15 339.304
PF00673 183.826
gtdb_PF00687.16 215.652
gtdb_PF00827.12 194
gtdb_PF00900.15 244.435
gtdb_PF01000.21 251.826
gtdb_PF01015.13 203.261
gtdb_PF01090.14 149.13
gtdb_PF01157.13 97.3478
gtdb_PF01200.13 70.913
gtdb_PF01655.13 129.478
gtdb_PF01798.13 422.565
gtdb_PF07541.7 273.217
gtdb_PF13685.1 344.783
TIGR00037 136.261
TIGR00064 326.217
TIGR00111 355.304
TIGR00279 180.087
TIGR00291 235.739
TIGR00373 184.043
TIGR00405 151.13
TIGR00448 187.87
TIGR00491 793.565
TIGR00501 295.304
TIGR00967 466.217
TIGR00982 147
TIGR01008 209.174
TIGR01012 203.826
TIGR01020 233.957
TIGR01028 211.435
TIGR01171 239.13
TIGR01425 446.87
TIGR02389 393.348
TIGR02390 904.773
TIGR03626 350.609
TIGR03628 135.913
TIGR03629 148.478
TIGR03670 1053.04
TIGR03673 140.087
TIGR03680 410.217
TIGR03722 353.391



### clean the header so that we can concatenate later


```bash
cd Marker_Genes/non_dedup/

#shorten header to be able to concatenate later
mkdir renamed

for sample in `cat $undin_markers`
do
cut -f1 -d "-" $sample*>> renamed/${sample}.faa
done

#control that all the names you expect are there --> 23
grep ">" renamed/*faa | cut -f2 -d ">" | sort | uniq > Names.txt
wc -l Names.txt

cd ../..
```





### make a count table


This table lists for each mag (before removing duplicates), how many (and which) markers we have got. You can easily run this for the deduplicated marker genes as well , by changing the folder in the first step.

```bash
#for each marker, list what genomes have that marker
cd Marker_Genes/non_dedup

#assumes that the genome and protein ID is separated by a `-`
#gives a two column table (Marker, Genome)
grep ">" *faa | sed 's/\.faa:>/\t/g' | awk 'BEGIN{FS=OFS="\t"}{split($2,a,"-")}{print $1,a[1]}' > MarkerList.txt

#make a count table in python
python

#load relevant libs
import numpy as np
import pandas as pd

#read in data, add header and view data
df = pd.read_csv('MarkerList.txt', sep="\t", header=None)
df.columns=["marker", "genome"]
df.head()

#group the data by the markers and count the occurences
count_table=df.groupby(["marker","genome"]).size().reset_index()
count_table.columns=["marker", "genome", "indiv_count"]
count_table.head()

#for each genome give a total count
total_counts = count_table.groupby(by=["genome"])["indiv_count"].sum().reset_index()
total_counts.columns = ['genome', 'total_count']
total_counts.head()

#combine with prev dataframe
count_table_2 = pd.merge(count_table, total_counts, on = "genome", how = "left")
count_table_2.head()

#convert from long to wide
counts_wide=count_table_2.pivot_table(values="indiv_count", index=["genome",'total_count'], columns= ["marker"], fill_value=0, margins=True)
counts_wide.head()

#print
counts_wide.to_csv('counts.txt',sep='\t')

#quit
exit()

cd ../..

#check file
#less -S Marker_Genes/dedup/counts.txt 
```




##  Prepare alignment

Notice, adjust the parallel option (-j) according to the free cpus, take care not to use more than 30% of the available cpus

**Notice: Depending on the nr of taxa and/or type of data analysis use either mafft or mafft-linsi **


### Align and trim

```bash
#6a. align with mafft
mkdir Alignment
mkdir Alignment/mafft

parallel -j6 'i={}; mafft-linsi  --reorder --thread 4 Marker_Genes/non_dedup/renamed/${i}* > Alignment/mafft/${i}.aln' ::: `cat $undin_markers`


#6b. Trim using BMGE
mkdir Alignment/BMGE
mkdir Alignment/BMGE/h0.55

parallel -j5 'i={}; nice -n 10 java -jar /opt/biolinux/BMGE-1.12/BMGE.jar -i Alignment/mafft/$i* -t AA -m BLOSUM30 -h 0.55 -of Alignment/BMGE/h0.55/$i ' :::  `cat $undin_markers`
```







### Calculate avg aln length

```bash
cd Alignment/BMGE/h0.55/

for sample in `cat $undin_markers`; do perl ~/../spang_team/Scripts/Others/length+GC.pl ${sample}* | awk -F'\t' -v var1=$sample '{sum+=$3 }END { print var1 , sum/NR }' ; done

cd ../../..
```


Avg aln length:

OG525. 589
PF00410.14 130
gtdb_PF00466.15 333
PF00673 182
gtdb_PF00687.16 216
gtdb_PF00827.12 194
gtdb_PF00900.15 243
gtdb_PF01000.21 257
gtdb_PF01015.13 197
gtdb_PF01090.14 150
gtdb_PF01157.13 97
gtdb_PF01200.13 70
gtdb_PF01655.13 122
gtdb_PF01798.13 394
gtdb_PF07541.7 273
gtdb_PF13685.1 344
TIGR00037 136
TIGR00064 298
TIGR00111 356
TIGR00279 180
TIGR00291 236
TIGR00373 166
TIGR00405 151
TIGR00448 186
TIGR00491 597
TIGR00501 293
TIGR00967 452
TIGR00982 147
TIGR01008 203
TIGR01012 199
TIGR01020 232
TIGR01028 215
TIGR01171 239
TIGR01425 442
TIGR02389 390
TIGR02390 903
TIGR03626 340
TIGR03628 132
TIGR03629 148
TIGR03670 1117
TIGR03673 140
TIGR03680 409
TIGR03722 324




### concatenate sequences


```bash
#6c. concatenate 
mkdir Alignment/concatenated

/export/lv1/user/spang_team/Scripts/catfasta2phyml/catfasta2phyml.pl -f -c Alignment/BMGE/h0.55/* > Alignment/concatenated/UndinMarkers_Thermococcus_v2.faa

#control that we have the nr of genomes we expect --> 23
grep -c ">" Alignment/concatenated/UndinMarkers_Thermococcus_v2.faa
```

Length of aln and position of indiv. markers:

Alignment/BMGE/h0.55/gtdb_PF00466.15 = 1-333
Alignment/BMGE/h0.55/gtdb_PF00687.16 = 334-549
Alignment/BMGE/h0.55/gtdb_PF00827.12 = 550-743
Alignment/BMGE/h0.55/gtdb_PF00900.15 = 744-986
Alignment/BMGE/h0.55/gtdb_PF01000.21 = 987-1243
Alignment/BMGE/h0.55/gtdb_PF01015.13 = 1244-1440
Alignment/BMGE/h0.55/gtdb_PF01090.14 = 1441-1590
Alignment/BMGE/h0.55/gtdb_PF01157.13 = 1591-1687
Alignment/BMGE/h0.55/gtdb_PF01200.13 = 1688-1757
Alignment/BMGE/h0.55/gtdb_PF01655.13 = 1758-1879
Alignment/BMGE/h0.55/gtdb_PF01798.13 = 1880-2273
Alignment/BMGE/h0.55/gtdb_PF07541.7 = 2274-2546
Alignment/BMGE/h0.55/gtdb_PF13685.1 = 2547-2890
Alignment/BMGE/h0.55/OG525. = 2891-3479
Alignment/BMGE/h0.55/PF00410.14 = 3480-3609
Alignment/BMGE/h0.55/PF00673 = 3610-3791
Alignment/BMGE/h0.55/TIGR00037 = 3792-3927
Alignment/BMGE/h0.55/TIGR00064 = 3928-4225
Alignment/BMGE/h0.55/TIGR00111 = 4226-4581
Alignment/BMGE/h0.55/TIGR00279 = 4582-4761
Alignment/BMGE/h0.55/TIGR00291 = 4762-4997
Alignment/BMGE/h0.55/TIGR00373 = 4998-5163
Alignment/BMGE/h0.55/TIGR00405 = 5164-5314
Alignment/BMGE/h0.55/TIGR00448 = 5315-5500
Alignment/BMGE/h0.55/TIGR00491 = 5501-6097
Alignment/BMGE/h0.55/TIGR00501 = 6098-6390
Alignment/BMGE/h0.55/TIGR00967 = 6391-6842
Alignment/BMGE/h0.55/TIGR00982 = 6843-6989
Alignment/BMGE/h0.55/TIGR01008 = 6990-7192
Alignment/BMGE/h0.55/TIGR01012 = 7193-7391
Alignment/BMGE/h0.55/TIGR01020 = 7392-7623
Alignment/BMGE/h0.55/TIGR01028 = 7624-7838
Alignment/BMGE/h0.55/TIGR01171 = 7839-8077
Alignment/BMGE/h0.55/TIGR01425 = 8078-8519
Alignment/BMGE/h0.55/TIGR02389 = 8520-8909
Alignment/BMGE/h0.55/TIGR02390 = 8910-9812
Alignment/BMGE/h0.55/TIGR03626 = 9813-10152
Alignment/BMGE/h0.55/TIGR03628 = 10153-10284
Alignment/BMGE/h0.55/TIGR03629 = 10285-10432
Alignment/BMGE/h0.55/TIGR03670 = 10433-11549
Alignment/BMGE/h0.55/TIGR03673 = 11550-11689
Alignment/BMGE/h0.55/TIGR03680 = 11690-12098
Alignment/BMGE/h0.55/TIGR03722 = 12099-12422




### check for sequences with lots of gaps and remove (optional)


At the moment we just remove sequences with 100% gaps (sanity check to see if the sequence removal is ok). Adjust for your purposes if you want to be more stringent.

```bash
python ~/../spang_team/Scripts/Others/faa_drop.py Alignment/concatenated/UndinMarkers_Thermococcus_v2.faa Alignment/concatenated/UndinMarkers_Thermococcus_v2_no_gappy_seq.faa 1.0 > gap_removal_summary

#how many sequence were removed? --> 0
wc -l gap_removal_summary 
```



### run iqtree 

```bash
#lg c60 on laplace
mkdir -p Phylogeny/IQtree/v1_lg_c60
cp Alignment/concatenated/UndinMarkers_Thermococcus_v2_no_gappy_seq.faa Phylogeny/IQtree/v1_lg_c60
cd Phylogeny/IQtree/v1_lg_c60

cp ~/../spang_team/Scripts/Bash_scripts/iqtree_concat.sh .

#started on laplace, no87, 21482
module load iqtree/2.1.1
iqtree2 -s UndinMarkers_Thermococcus_v2_no_gappy_seq.faa  -m LG+C60+F+R  -T AUTO --threads-max 80 -bb 1000 -alrt 1000
```
