Published January 6, 2021
| Version v1
Dataset
Open
Microbial Recombination with Population Structure
Description
This collection of data files contains, for each bacterial species:
- All raw genome sequence files.
- The core genome alignment obtained with REALPHY (this is the file with the .phy extension).
- A file with all SNP columns in the core genome alignment (the file name starts with columns_).
- A file listing all SNP types sorted from most to least common (the file name starts with snp_stats_)
- Two files containing the results of the pairwise analysis. First, a file with, for each pair, the histogram of SNP counts per alignment block (the file name ends in _histograms). And second, a file with the results of the mixture modeling (the file with the .pkl extension).
In addition, for M. tuberculosis there is a subfolder with information about which strains have since been retracted from the database.
The formats of these files are as follows:
- The raw genome sequence files are in FASTA format (.fasta or .fna).
- The core genome alignment is in PHYLIP multiple alignment format (.phy).
- The snp_stats file starts with a header line listing the total number of columns in the alignment with 1, 2, 3, and 4 different nucleotides. Each next line in the file corresponds to an observed SNP-type, sorted from most to least common. Each SNP line has the following columns:
- The total amount of genomic DNA associated with these SNP columns (associating each conserved alignment column to its closest SNP).
- The total number of occurrences of this SNP type.
- The number of strains sharing the minority allele.
- A bit-pattern describing the SNP type, with 1 for the strains sharing the minority allele, and zero for the others. The strains are sorted in the same order as in the PHYLIP alignment file.
- A list of all the strains sharing the minority allele.
- The columns_ file has one SNP per line, giving the position in the alignment plus the bit-pattern describing the SNP.
- The _histogram file contains, for each pair of strains, a histogram counting the number of 1Kb blocks with 0, 1, 2, etc SNPs. Note that these counts come from 1 kilobase sliding windows along the core genome alignment, sliding the window by 100 bases at a time, i.e. an alignment column will typically occur in 10 blocks.
- A pickle file with, for each pair, the results of the mixture modeling. Each line corresponds to a pair and has these fields: [spec1, spec2, div, Lpois, r_nomix, Lmix, rho, r, a, lam, mutpois, mutrecomb, cut] which correspond to:
- Name of strain 1.
- Name of strain 2.
- Their overall nucleotide divergence.
- The log-likelihood under a model assuming SNP counts form a simple Poisson distribution.
- The parameter of this fitted Poisson distribution.
- The log-likelihood of the mixture of a Poisson and negative binomial
- The fraction rho assigned to the Poisson part of the mixture.
- The parameter of the Poisson component.
- The exponent a of the negative binomial component.
- The second parameter (lambda) of the negative binomial.
- The estimated total number of mutations in the Poisson component.
- The estimated total number of mutations in the negative binomial component.
- The value at which the likelihood of negative poisson component starts exceeding the likelihood of the negative binomial component.
In addition, for the human data we provide a PHYLIP multiple genome alignment and a file with all SNP columns.
Files
recombination.zip
Files
(32.8 GB)
Name | Size | Download all |
---|---|---|
md5:527ba8a330a9724a51d8008eff1052a4
|
32.8 GB | Preview Download |