There is a newer version of the record available.

Published April 10, 2023 | Version 1.0.0
Dataset Open

CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models

  • 1. University of California, Berkeley

Description

Simulated datasets used in our paper "CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models" to produce figures 1bc, 1d, and 2ab. The data provided in each folder is as follows:

  • rate_matrices contains the classical LG rate matrix, and our 400 x 400 estimated co-evolutionary model Q2.
  • fig_1bc contains the simulated data used to estimate and evaluate rate matrices using the CherryML method and EM (with XRATE) as shown in Fig. 1b and c of our paper. The files and sub-directories here are:
    • fig_1bc_simulated_data_families_all.txt contains the list of protein family names used to train the model. When only K families are used in Fig. 1b and c, these are the first K families of this list.
    • gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail.
    • msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, without site rate variation.
    • gt_site_rates_dir contains the site rates used. In this case, they are all 1.
    • gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory.
  • fig_1d folder contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are:
    • gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail.
    • msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, with site rate variation.
    • gt_site_rates_dir contains the site rates used.
    • gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory.
  • fig_2ab contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are:
    • gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail.
    • msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each non-contacting tree, and using Q2 for the contacting sites, all without site rate variation.
    • gt_site_rates_dir contains the site rates used, in this case all 1 (i.e. no site rate variation).
    • gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory.
    • contact_map_dir contains the simulated contact maps for each family. These were obtained by computing a maximal matching on the true contact maps derived from the trRosetta paper, as described in detail in out paper.

The exact end-to-end code which generates these simulated datasets is provided in our Github repository: https://github.com/songlab-cal/CherryML

In fact, by default, when you try to reproduce the figures in our paper by running the `reproduce_all_figures.py` script in our repository, the data will automatically be simulated for you if it isn't already present. This can be bypassed by downloading the data here in Zenodo and changing the top of `reproduce_all_figures.py` to point to these files.

Files

Files (4.8 GB)

Name Size Download all
md5:8fc4f12349085b5c979fcf6df18b3e6b
30.5 MB Download
md5:0df1e9b2cebe23e35a9fb971656a2c70
2.4 GB Download
md5:b254c7f9b47b7b6ebae0150c1b549511
2.4 GB Download
md5:2133ac0985c6c6ca570a5f748ef02fb2
829.5 kB Download