Microbial Recombination with Population Structure

Published January 6, 2021 | Version v1

Dataset Open

This collection of data files contains, for each bacterial species:

All raw genome sequence files.
The core genome alignment obtained with REALPHY (this is the file with the .phy extension).
A file with all SNP columns in the core genome alignment (the file name starts with columns_).
A file listing all SNP types sorted from most to least common (the file name starts with snp_stats_)
Two files containing the results of the pairwise analysis. First, a file with, for each pair, the histogram of SNP counts per alignment block (the file name ends in _histograms). And second, a file with the results of the mixture modeling (the file with the .pkl extension).

In addition, for M. tuberculosis there is a subfolder with information about which strains have since been retracted from the database.

The formats of these files are as follows:

The raw genome sequence files are in FASTA format (.fasta or .fna).
The core genome alignment is in PHYLIP multiple alignment format (.phy).
The snp_stats file starts with a header line listing the total number of columns in the alignment with 1, 2, 3, and 4 different nucleotides. Each next line in the file corresponds to an observed SNP-type, sorted from most to least common. Each SNP line has the following columns:

The total amount of genomic DNA associated with these SNP columns (associating each conserved alignment column to its closest SNP).
The total number of occurrences of this SNP type.
The number of strains sharing the minority allele.
A bit-pattern describing the SNP type, with 1 for the strains sharing the minority allele, and zero for the others. The strains are sorted in the same order as in the PHYLIP alignment file.
A list of all the strains sharing the minority allele.

The columns_ file has one SNP per line, giving the position in the alignment plus the bit-pattern describing the SNP.
The _histogram file contains, for each pair of strains, a histogram counting the number of 1Kb blocks with 0, 1, 2, etc SNPs. Note that these counts come from 1 kilobase sliding windows along the core genome alignment, sliding the window by 100 bases at a time, i.e. an alignment column will typically occur in 10 blocks.
A pickle file with, for each pair, the results of the mixture modeling. Each line corresponds to a pair and has these fields: [spec1, spec2, div, Lpois, r_nomix, Lmix, rho, r, a, lam, mutpois, mutrecomb, cut] which correspond to:

Name of strain 1.
Name of strain 2.
Their overall nucleotide divergence.
The log-likelihood under a model assuming SNP counts form a simple Poisson distribution.
The parameter of this fitted Poisson distribution.
The log-likelihood of the mixture of a Poisson and negative binomial
The fraction rho assigned to the Poisson part of the mixture.
The parameter of the Poisson component.
The exponent a of the negative binomial component.
The second parameter (lambda) of the negative binomial.
The estimated total number of mutations in the Poisson component.
The estimated total number of mutations in the negative binomial component.
The value at which the likelihood of negative poisson component starts exceeding the likelihood of the negative binomial component.

In addition, for the human data we provide a PHYLIP multiple genome alignment and a file with all SNP columns.

Files

Name	Size	Download all
recombination.zip md5:527ba8a330a9724a51d8008eff1052a4	32.8 GB	Preview Download