Published May 18, 2023 | Version v1
Other Open

Supplementary material for: PhyloCoalSimulations: A simulator for network multispecies coalescent models, including a new extension for the inheritance of gene flow

  • 1. University of Wisconsin-Madison
  • 2. University of Alaska Fairbanks

Description

We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages, or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example.

We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.

Notes

Supplementary Material

`supplementarymaterial.pdf` contains an appendix and supplementary figures S1-S6.

Code to reproduce analyses

The code uses Julia and R. Files `Project.toml` and `Manifest.toml` record the Julia packages used and their specific version. To reproduce the environment, activate this folder and run `instantiate` in package mode within Julia.

Fig. 1: node mapping

`figures.jl`: Julia code to simulate a gene tree with degree-2 nodes for mapping of the gene tree into the species network, and to create the first 2 panels of Fig.1, output as `fig_nodemapping*.pdf`

Fig. 2: validation of quartet concordance factors
  • `validation_qCF.jl`: Julia code to reproduce the simulations in Fig.2. Running this code will create 3 output files: `qCF_4taxa.csv` for the left network a), and `qCF_case_{1,2}.csv` for the right network b) on 6 taxa. It will also create `net4.pdf` and `net3.pdf`, showing the 4-taxon and 6-taxon networks respectively.
  • `validation_qCF.Rmd`: R code to create Fig.2, taking as input the CSV files from above.
Fig. 3: level-2 network
  • `fig_level2_network.jl`: Julia code to create Fig.3 showing the level-2 network that was used to validate the distribution of pairwise distances, using either rho=0 or 1 (independent or common inheritance). output: file `fig_level2net.pdf`.
  • `ntwk_level_2.tre`: file containing the Newick description of that network, which can be visualized with julia package PhyloPlots.
Figures 4 and supplementary figures: validation of pairwise distances

Figure S2, on a 4-taxon species tree:

  • `gtrees_4tax-changing_PhyloNetworks.tre` contains the 10k simulated gene trees.
  • `validation_distances.Rmd`: R code to create Fig.S2, taking as input the gene trees in `gtrees_4tax-changing_PhyloNetworks.tre`.

Figure 4 and supplementary figures S3-S5, on a 6-taxon network with 2 reticulations:

  • folders `validation_distances_level2net_rho0` and `validation_distances_level2net_rho1`: input files for the figures as compressed `.RData` files:
    • The `samp_big*_d_xy.RData` files contain the pairwise distances from the 100k simulated gene trees between taxa x and y (used for histograms)
    • the `d_*.RData` files contain the pairwise distances drawn from their theoretical distributions, summarized by their frequency in 100,000 small bins (for the theoretical density curve)
    • the `sampleMeans*_dxy.Rdata` files contain the mean (over 100 replicates) of the 1000 ordered distances between taxa x and y (dbar_i in the paper) from 1000 simulated gene trees in each replicate (used for QQ plots).
    • `Dmatrix.Rdata` contains the 6×6 matrix of *minimum* pairwise distances between all pairs of taxa on the network.
  • `validation_distances_level2net_figure.R`: R code to create Fig.4 and supplementary figures, for the distances from the 6-taxon level-2 network in Fig.3. Takes as input files in folders above. output: files `fig_pairwisedist_level2net_rho*.pdf`.
Software archive
  • `PhyloCoalSimulations-code-1d266fd.zip`: archive of the PhyloCoalSimulations package from GitHub, main branch (for the code), at commit 1d266fd, which is 1 commit ahead of version v0.1.2.
  • `PhyloCoalSimulations-documentation-1d266fd.zip`: archive of the PhyloCoalSimulations package's documentation, from the gh-pages branch, at commit 1d266fd.

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: 1902892

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: 2023239

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: 2051760

Files

supplementarymaterial.pdf

Files (2.1 MB)

Name Size Download all
md5:4822cc02544720cc31ef578e19952f49
2.1 MB Preview Download

Additional details

Related works

Is derived from
10.5061/dryad.02v6wwq6x (DOI)