Published August 4, 2017 | Version v1
Dataset Open

Data from: Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation

  • 1. Dalhousie University
  • 2. University of Vienna

Description

Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data using a preliminary guide tree. These PMSF profiles can then be used for in-depth tree-searching in place of the full mixture model. Compared with widely used empirical mixture models with k classes, our implementation of PMSF in IQ-TREE (http://www.iqtree.org) speeds up the computation by approximately k /1.5-fold and requires a small fraction of the RAM. Furthermore, this speedup allows, for the first time, full nonparametric bootstrap analyses to be conducted under complex site-heterogeneous models on large concatenated data matrices. Our simulations and empirical data analyses demonstrate that PMSF can effectively ameliorate long-branch attraction artefacts. In some empirical and simulation settings PMSF provided more accurate estimates of phylogenies than the mixture models from which they derive.

Notes

Files

PMSF.Sup.Materials.2.pdf

Files (1.9 GB)

Name Size Download all
md5:71c53012220b792d8e499e38123be250
142.0 kB Preview Download
md5:23ea973fc91874a3c6e8a5915719d15e
2.1 MB Preview Download
md5:e7a163520dc580818d56c5e94b6bfa47
7.6 MB Download
md5:3616821a08ea541df3bfde99c4f3350a
281.1 kB Download
md5:4eb041e015a300e3414c558f46051fa8
18.6 MB Download
md5:1ee315e6a0150e28866cdc0abf69898c
397.7 MB Download
md5:d9efab8991baa2b0f9bb3a5acf17363a
394.5 MB Download
md5:e0a145b073f8a7098ca0cbd2d7dc5bad
110.4 MB Download
md5:0e8ae6cceed017321ad1bc32faa6555b
18.6 MB Download
md5:82b81a4b2ff30b807c0b33d6fd95974e
8.6 MB Download
md5:d046702f6adec8cb11b6bff95caa296b
11.1 MB Download
md5:a74becb2bc61f54a37d1c51c4a31432b
13.6 MB Download
md5:f69c91d9f5617246a0702bde09ba5d34
6.1 MB Download
md5:70316ac89de41533ad567caa9c0b8b79
18.3 MB Download
md5:ac13f6d1f7ba92d3d86e54bb608c8cda
361.1 MB Download
md5:5a724dcd7b6f40e7aa82645a8901fa53
358.6 MB Download
md5:d25b4cb9ae527ff5d2b238fb2eb8830d
108.4 MB Download
md5:533ecc80a46cbdb0a9e220333521d972
18.3 MB Download

Additional details

Related works

Is cited by
10.1093/sysbio/syx068 (DOI)