CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models

Sebastian Prillo; Yun Deng; Pierre Boyeau; Xingyu Li; Po-Yen Chen; Yun S. Song

doi:10.5281/zenodo.7814723

Published April 10, 2023 | Version 1.0.0

Dataset Open

CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models

1. University of California, Berkeley

Simulated datasets used in our paper "CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models" to produce figures 1bc, 1d, and 2ab. The data provided in each folder is as follows:

rate_matrices contains the classical LG rate matrix, and our 400 x 400 estimated co-evolutionary model Q2.
fig_1bc contains the simulated data used to estimate and evaluate rate matrices using the CherryML method and EM (with XRATE) as shown in Fig. 1b and c of our paper. The files and sub-directories here are:
- fig_1bc_simulated_data_families_all.txt contains the list of protein family names used to train the model. When only K families are used in Fig. 1b and c, these are the first K families of this list.
- gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail.
- msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, without site rate variation.
- gt_site_rates_dir contains the site rates used. In this case, they are all 1.
- gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory.
fig_1d folder contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are:
- gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail.
- msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, with site rate variation.
- gt_site_rates_dir contains the site rates used.
- gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory.
fig_2ab contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are:
- gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail.
- msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each non-contacting tree, and using Q2 for the contacting sites, all without site rate variation.
- gt_site_rates_dir contains the site rates used, in this case all 1 (i.e. no site rate variation).
- gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory.
- contact_map_dir contains the simulated contact maps for each family. These were obtained by computing a maximal matching on the true contact maps derived from the trRosetta paper, as described in detail in out paper.

The exact end-to-end code which generates these simulated datasets is provided in our Github repository: https://github.com/songlab-cal/CherryML

In fact, by default, when you try to reproduce the figures in our paper by running the `reproduce_all_figures.py` script in our repository, the data will automatically be simulated for you if it isn't already present. This can be bypassed by downloading the data here in Zenodo and changing the top of `reproduce_all_figures.py` to point to these files.

Files

Files (4.8 GB)

Name	Size	Download all
fig_1bc.tgz md5:8fc4f12349085b5c979fcf6df18b3e6b	30.5 MB	Download
fig_1d.tgz md5:0df1e9b2cebe23e35a9fb971656a2c70	2.4 GB	Download
fig_2ab.tgz md5:b254c7f9b47b7b6ebae0150c1b549511	2.4 GB	Download
rate_matrices.tgz md5:2133ac0985c6c6ca570a5f748ef02fb2	829.5 kB	Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	444	158
Downloads	142	25
Data volume	162.2 GB	24.0 GB

CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models

Creators

Description

Files

Files (4.8 GB)