# A pandemic clonal lineage of the wheat blast fungus
# 5. Phylogenetic Analyses

Program                         | Location
------------------------------- | --------------------------------------
*bcftools v.1.11*               | (https://github.com/samtools/bcftools)
*PLINK v.1.9*                   | (https://www.cog-genomics.org/plink)
*tped2fasta*                    | (https://github.com/smlatorreo/misc_tools)
*RAxML-NG v.1.0.3*              | (https://github.com/amkozlov/raxml-ng)
*ClonalFrameML v.1.12*          | (https://github.com/xavierdidelot/clonalframeml)
*clean_homoplasy_from_fasta.py* | [This repository](/scripts/05_Phylogeny/clean_homoplasy_from_fasta.py)
*BactDating*                    | (https://github.com/xavierdidelot/BactDating)
*R*                             | (https://cran.r-project.org/)
*mask_positions.py*             | [This repository](/scripts/05_Phylogeny/mask_positions.py)
*BEAST2*                        | (http://www.beast2.org/)

To carry out the phylogenetic analysis we only used non-recombining genetic groups (clonal lineages) (see [4. Recombination analyses](/04_Recombination_Analyses.md)). The final dataset included all isolates from the B71 and the PY0925 clonal lineages (the latter was used as outgroup). We only used positions with no-missing data (full information).

```bash
bcftools view -a -S B71clust_PY0925clust.list wheat-blast.snps.filtered.vcf.gz |
bcftools view -m2 -M2 -g ^miss |
bgzip > B71clust_PY0925clust.snps.filtered.fullinfo.vcf.gz
```

## Maximum-Likelihood (ML) phylogeny
We converted the VCF file into a pseudo-fasta format to have whole-genome concatenated SNPs per isolate as a suitable input for the phylogenetic analyses.

```bash
plink --allow-extra-chr --vcf B71clust_PY0925clust.snps.filtered.fullinfo.vcf.gz \
--recode transpose --out B71clust_PY0925clust.snps.filtered.fullinfo

tped2fasta B71clust_PY0925clust.snps.filtered.fullinfo > B71clust_PY0925clust.snps.filtered.fullinfo.fasta
```

Files can be found at: [B71clust_PY0925clust.snps.filtered.fullinfo.tped](/data/05_Phylogeny/B71clust_PY0925clust.snps.filtered.fullinfo.tped) ; [B71clust_PY0925clust.snps.filtered.fullinfo.tfam](/data/05_Phylogeny/B71clust_PY0925clust.snps.filtered.fullinfo.tfam) ; [B71clust_PY0925clust.snps.filtered.fullinfo.fasta](/data/05_Phylogeny/B71clust_PY0925clust.snps.filtered.fullinfo.fasta).  

Then, we generated a ML phylogeny using RAxML-NG with a GTR+G substituion model and 1,000 bootstrap replicates.
```bash
raxml-ng --all --msa B71clust_PY0925clust.snps.filtered.fullinfo.fasta --msa-format FASTA \
--data-type DNA --model GTR+G --bs-trees 1000
```
The best tree (with bootstrap support values) can be found at: [B71clust_PY0925clust.snps.filtered.fullinfo.raxml.support](/data/05_Phylogeny/B71clust_PY0925clust.snps.filtered.fullinfo.raxml.support)

## Detection of putative recombination events
To detect putative recombination events and take those in account for the phylogenetic reconstruction we used *ClonalFrameML*.

```bash
ClonalFrameML B71clust_PY0925clust.snps.filtered.fullinfo.fasta.raxml.bestTree B71clust_PY0925clust.snps.filtered.fullinfo.fasta
```
Important summary statistics produced by *ClonalFrameML* can be found at [this table](/data/05_Phylogeny/B71_and_PY0925_clust.snps.filtered.fullinfo.em.txt).  
We used the output of *ClonalFrameML* as input for the dating analyses (see Phylogenetic dating).  

Furthermore, we tested the effect of removing the genomic regions with recombination events from the ML phylogenetic reconstruction. For this purpose we used the output file *_prefix_.importation_status.txt* to remove all the regions from the original concatenated-SNPs alignment file and the custom *Python* script [*clean_homoplasy_from_fasta.py*](/scripts/05_Phylogeny/clean_homoplasy_from_fasta.py)
```bash
python clean_homoplasy_from_fasta.py B71clust_PY0925clust.snps.filtered.fullinfo.importation_status.txt \
B71clust_PY0925clust.snps.filtered.fullinfo.fasta > B71clust_PY0925clust.snps.filtered.fullinfo.clean.fasta \
2> B71clust_PY0925clust.snps.filtered.fullinfo.homoplasy.fasta
```

Finally, we used the filtered fasta alignment file and computed again a ML phylogeny with RAxML-NG
```bash
raxml-ng --all --msa B71clust_PY0925clust.snps.filtered.fullinfo.clean.fasta --data-type DNA \
--model GTR+G --bs-trees 1000
```

## Phylogenetic dating
### Temporal Signal

We used the recombination-free tree generated by *ClonalFrameML* as input [B71clust_PY0925clust.snps.filtered.fullinfo.labelled_tree.newick](/data/05_Phylogeny/B71_and_PY0925_clust.snps.filtered.fullinfo.labelled_tree.newick). To evaluate the presence of a a temporal signal in the dataset we used the isolate collection dates [B71_and_PY0925_clust.dates](/data/05_Phylogeny/B71_and_PY0925_clust.dates).
```{r}
# R
library(ape)
library(scales)

# Load ML tree
t=read.tree('B71clust_PY0925clust.snps.filtered.fullinfo.labelled_tree.newick)

# Compute pairwise cophenetic / patristic distances and select distances to PY0925
distances <- cophenetic(t)
dist_to_PY0925 <- distances[colnames(distances) == 'PY0925', ]

# Just keep distances of the B71 lineage (remove those from isolates: '053i','PY0925','117','37','12.1.037')
dist_to_PY0925 <- dist_to_PY0925[! names(d_to_PY0925) %in% c('053i','PY0925','117','37','12.1.037')]

# Load the collection dates and match them with the distances
dt <- read.table('B71_and_PY0925_clust.dates', header = FALSE)
dts <- c()
for(n in names(dist_to_PY0925)){dts <- c(dts, dt[dt[,1] == n, 2])}
m <- data.frame(Coll_Year = dts, Patr_dist_to_PY0925 = dist_to_PY0925)

plot(m)
legend('topleft', paste("Pearson\'s r =", round(cor(m$Coll_Year, m$Patr_dist_to_PY0925), 2)), bty = 'n')
abline(lm(m$Distance ~ m$Date), lty = 2)
cor.test(m$Coll_Year, m$Patr_dist_to_PY0925)
```
![Distances vs Dates](/data/05_Phylogeny/Dist_vs_Dates.png)

We tested the robustness of the correlation signal between root-to-tip distance and collection dates by sampling with replacement and recalculating the correlation coefficient 1,000 times. Additionally, we randomly permutate the collection dates of each isolate and recalculate the correlation coefficient 1,000 times. 
```{r}
set.seed(123)
resamplings <- c()
permutations <- c()
for(i in 1:1000){
	nm <- m[sample(nrow(m), replace = T), ]
	p <- cor(nm$Coll_Year, nm$Patr_dist_to_PY0925)
	resamplings <- c(resamplings, p)
	nmp <- cbind(sample(m$Coll_Year, replace = F), m$Patr_dist_to_PY0925)
	p <- cor(nmp[,1], nmp[,2])
	permutations <- c(permutations, p)
}
boxplot(cbind(resamplings, permutations), outline = FALSE)
```
![Resampling and Permutation](/data/05_Phylogeny/Resampling_Permutation.png)

### Phylogenetic Dating using BactDating
Finally, we used *BactDating* to generate a dated phylogeny. We utilized the function *loadCFML*, which permit the direct use the output of *ClonalFrameML's* as input file.
```{r}
# R
library(BactDating)

tree = loadCFML('B71_and_PY0925_clust.snps.filtered.fullinfo')
rooted = initRoot(tree, dts, mtry = 1000)

rslt = bactdate(rooted, dts, nbIts = 1000000, thin = 1000, updateRoot = F, showProgress = T)

```

## Phylogenetic Dating using BEAST 2
We used the output file information provided by [*ClonalFrameML*](/data/05_Phylogeny/B71clust_PY0925clust.snps.filtered.fullinfo.importation_status_NODEs_removed.txt) to mask putative recombining SNPs and labelled as missing information "?" using the custom Python script [*mask_positions.py*](/scripts/05_Phylogeny/mask_positions.py) on the genome-wide SNPs alignment [fasta file](/data/05_Phylogeny/B71clust_PY0925clust.snps.filtered.fullinfo.fasta)

```bash
python mask_positions.py B71clust_PY0925clust.snps.filtered.fullinfo.fasta B71clust_PY0925clust.snps.filtered.fullinfo.importation_status_NODEs_removed.txt > B71_and_PY0925_clust.snps.filtered.fullinfo.recomb_masked.fasta
```

The resulting [masked fasta file](/data/05_Phylogeny/B71_and_PY0925_clust.snps.filtered.fullinfo.recomb_masked.fasta) was used as input to create the configuration file with *beauti (BEAST 2)* using the following parameters and options:

- Tips were calibrated with the [collection dates](/data/05_Phylogeny/B71_and_PY0925_clust.dates)
- HKY substitution model
- Strict Clock rate
- Uniform prior for the clock rate: [1E-10 to 1E-3] with a starting value of 1E-5
- Tree prior: Coalescent Extended Bayesian Skyline
- Monophyletic prior for the different clusters: Zambian isolates; Bangaladeshi isolates ; B71 cluster ; PY0925 cluster
- Chain length: 20'000,000
- Log every: 1,000
- Accounting for invariant sites by manually including the tag `constantSiteWeights='9117544 9766162 9779548 9135832'` after the `<data>` block

The resulting [XML configuration file](/data/05_Phylogeny/B71_and_PY0925_clust.recomb_masked.BEAST2.xml) was submitted to the [CIPRES Science Gateway](https://www.phylo.org/) with the following command to compute a Bayesian tip-dated phylogenetic reconstruction:
```bash
beast -threads 3 -instances 3 -beagle_SSE -beagle_scaling dynamic infile.xml
```

After four independent chains were computed, we used *LogCombiner* and *TreeAnotator* from *BEAST 2* to combine the chains and calculate a [Maximum Credibility Tree](/data/05_Phylogeny/B71_and_PY0925_clust.recomb_masked.COMBINED.MC.tree), respectively.  


To test the robustness of our evolutionary rate estimation to changes in substitution and clock models, we repeated the analysis using GTR in combination with a strict clock model, and HYK in combination with a random local clock model. The BEAST2 XML configuration files can be found here: [GTR - with strict clock](/data/05_Phylogeny/B71_and_PY0925_clust.recomb_masked.BEAST2.GTR_StrictClock.xml) and [HYK - with Random Local Clock](/data/05_Phylogeny/B71_and_PY0925_clust.recomb_masked.BEAST2.HYK_RandomLocalClock.xml)

---
[Main README](/README.md) | [Previous - 04. Recombination Analyses](/04_Recombination_Analyses.md) | [Next - 06. Mating Type Analyses](/06_Mating_Type.md)
