Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published November 21, 2022 | Version v1
Dataset Open

Expectation-Maximization enables phylogenetic dating under a Categorical Rate Model

  • 1. University of California, San Diego

Description

Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many downstream applications but has remained challenging. Dating requires inferring mutation rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a clock model that defines a distribution over rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical clock model vastly differs from the true distribution. Notably, existing methods tend to assume rigid, often unimodal rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization (EM) algorithm to co-estimate rate categories and branch lengths in the time unit. Our model has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with nonmodal or multimodal clock models.

Notes

Funding provided by: National Institutes of Health
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000002
Award Number: 1R35GM142725

Files

README.md

Files (198.1 MB)

Name Size Download all
md5:9283a8c329cd9b6907db52933e7e29fa
4.5 kB Preview Download
md5:aae8fe5fa01da83de2d2f45d66f55548
1.8 kB Preview Download
md5:47e88ccb516595f072344f05a56c1cf8
8.2 kB Download
md5:7ef96b38002851aaf77a045caafdb999
6.7 kB Download
md5:d051815c4c310f363cb4faafb9c92cbb
6.7 kB Download
md5:2bc4f1c1d90fd7812c6fb74ac46ca8fe
744.6 kB Download
md5:6274e61477783133fca9f3e25e5b4daa
36.7 kB Preview Download
md5:5323a2db1cfa1b2d5300da7feaeb3adf
61.8 kB Download
md5:fffb74a1b323e17919252a1268ae051f
102.4 kB Download
md5:36e2a55b4a78d4617a0317d752e48054
3.0 MB Download
md5:c1b5ef2ab2eb440be693212f0024452f
30.9 MB Download
md5:d254bf6d536bdb3cc03b6fb5722d7966
941.3 kB Download
md5:b091fdab384b75842863684932dfd6d8
827.3 kB Download
md5:46a0f025f23dab05c221121716130622
2.2 MB Download
md5:aa214bcbabb2639c40562689f9215651
24.5 MB Download
md5:be4284af57b244deade79620ede40f29
5.3 MB Download
md5:bd8a8e04026f550e4548053c07739d04
2.6 MB Download
md5:4d2a1b5690e2d655775007712ef0fdb2
2.8 MB Download
md5:5296a45ddb6c7c30afd21bbd2b615334
7.4 MB Download
md5:39ace72c15a75d0b03154757c2119bbd
38.7 MB Download
md5:62a0170a37793836b3ac22fdaff92187
5.3 MB Download
md5:25d7ce7cd259913095aec70a096d65c0
2.6 MB Download
md5:dca9f1f9fb01fe702d9ad79234c59e51
2.8 MB Download
md5:a122829a396dd62e5df448642a919d55
7.4 MB Download
md5:38a52627c6a70ea87815c0c917ed2b58
36.0 MB Download
md5:aaa2cea7b572f89331ebb9afaf453e1d
5.2 MB Download
md5:1e3a19ffa5e6e187496ab0678375db11
2.6 MB Download
md5:5695e8e4ab910cb1367c4e95f0becc9f
8.7 MB Download
md5:512f4c32239c2c3020734525f7015c6e
7.4 MB Download

Additional details

Related works