Data from: Improved inference of site-specific positive selection under a generalized parametric codon model when there are multinucleotide mutations and multiple nonsynonymous rates

Dunn, Katherine A.; Kenney, Toby; Gu, Hong; Bielawski, Joseph P.

doi:10.5061/dryad.m4dr156

Published December 14, 2018 | Version v1

Dataset Open

Data from: Improved inference of site-specific positive selection under a generalized parametric codon model when there are multinucleotide mutations and multiple nonsynonymous rates

1. Dalhousie University

Background: An excess of nonsynonymous substitutions, over neutrality, is considered evidence of positive Darwinian selection. Inference for proteins often relies on estimation of the nonsynonymous to synonymous ratio (ω=dN/dS) within a codon model. However, to ease computational difficulties, ω is typically estimated assuming an idealized substitution process where (i) all nonsynonymous substitutions have the same rate (regardless of impact on organism fitness) and (ii) instantaneous double and triple (DT) nucleotide mutations have zero probability (despite evidence that they can occur). It follows that estimates of ω represent an imperfect summary of the intensity of selection, and that tests based on the ω>1 threshold could be negatively impacted. Results: We developed a general-purpose parametric (GPP) modelling framework for codons. This novel approach allows specification of all possible instantaneous codon substitutions, including multiple nonsynonymous rates (MNRs) and instantaneous DT nucleotide changes. Existing codon models are specified as special cases of the GPP model. We use GPP models to implement likelihood ratio tests for ω>1 that accommodate MNRs and DT mutations. Through both simulation and real data analysis, we find that failure to model MNRs and DT mutations reduces power in some cases and inflates false positives in others. False positives under traditional M2a and M8 models were very sensitive to DT changes. This was exacerbated by the choice of frequency parameterization (GY vs. MG), with rates sometimes >90% under MG. By including MNRs and DT mutations, accuracy and power was greatly improved under the GPP framework. However, we also find that over-parameterized models can perform less well, and this can contribute to degraded performance of LRTs. Conclusions: We suggest GPP models should be used alongside traditional codon models. Further, all codon models should be deployed within an experimental design that includes (i) assessing robustness to model assumptions, and (ii) investigation of non-standard behaviour of MLEs. As the goal of every analysis is to avoid false conclusions, more work is needed on model selection methods that consider both the increase in fit engendered by a model parameter and the degree to which that parameter is affected by un-modelled evolutionary processes.

Notes

Sequence data for simulation studies

Multi-sequence alignments for simulation study 1 (cases 1a-1c) simulation study 2 (cases 2a-2h) and simulation study 3 (cases 3a-3c).

Simulation_sequence_data.tar.gz

Newick tree files for simulation studies

Newick formatted (parenthetical notation) tree files, including branch lengths, used in each simulation study. The scale of the branch lengths is mean number of substitutions per codon

Simulation_trees.txt

COLD program commands for simulation studies

Commands required to generate the sequence data of each simulation scenario using the COLD program

COLD_simulation_scripts.tar.gz

Names of genes in the real data analysis

Full names and ID numbers for the 24 Streptococcus transmembrane proteins analyzed.

Streptococcus_gene_names.txt

Streptococcus gene sequence data

Multi-sequence alignments for the 24 Streptococcus transmembrane proteins. Ambiguous alignment positions are removed.

Streptococcus_gene_sequences.txt

Gene trees for each Streptococcus gene

Newick formatted (parenthetical notation) tree files, including branch lengths, for each of the 24 Streptococcus transmembrane proteins analyzed. The scale of the branch lengths is mean number of substitutions per codon

Streptococcus_trees.txt

Files

Simulation_trees.txt

Files (9.8 MB)

Name	Size	Download all
COLD_simulation_scripts.tar.gz md5:b63ba7be38bf05af3b57c1d1613b6d14	6.5 kB	Download
Simulation_sequence_data.tar.gz md5:dae93473487c01bd5677a7b02f524aea	9.6 MB	Download
Simulation_trees.txt md5:3607011c37d4e377646d2ba282728d63	785 Bytes	Preview Download
Streptococcus_gene_names.txt md5:781c74093cc18689c51431d34deb605d	1.3 kB	Preview Download
Streptococcus_gene_sequences.txt md5:6173313eeb695f431cc26f464263fda9	243.0 kB	Preview Download
Streptococcus_trees.txt md5:76f3696851e3e8cf819aad9c1f7a1e4f	10.7 kB	Preview Download

Additional details

Is cited by: 10.1186/s12862-018-1326-7 (DOI)

	All versions	This version
Views	112	112
Downloads	24	24
Data volume	39.9 MB	39.9 MB

Data from: Improved inference of site-specific positive selection under a generalized parametric codon model when there are multinucleotide mutations and multiple nonsynonymous rates

Creators

Description

Notes

Files

Simulation_trees.txt

Files (9.8 MB)

Additional details

Related works