Published June 19, 2018 | Version 1.0.2
Dataset Open

sistr_cmd v1.0.2 serotyping databases

  • 1. Public Health Agency of Canada

Description

Salmonella In Silico Typing Resource (SISTR) sistr_cmd version 1.0.2 serotyping databases

File structure tree for sistr_cmd data folder:

.
|-- [4.0K]  antigens
|   |-- [1.0M]  fliC.fasta
|   |-- [210K]  fljB.fasta
|   |-- [126K]  wzx.fasta
|   `-- [ 60K]  wzy.fasta
|-- [4.0K]  cgmlst
|   |-- [7.4M]  cgmlst-centroid.fasta
|   |-- [ 96M]  cgmlst-full.fasta
|   |-- [134M]  cgmlst-profiles.hdf
|   `-- [ 803]  README.md
|-- [1.1M]  genomes-to-serovar.txt
|-- [1.0M]  genomes-to-subspecies.txt
|-- [118K]  Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv
`-- [ 92M]  sistr.msh

2 directories, 12 files

Description of files:

  • genomes-to-serovar.txt: Each genome id to serovar designation delimited by tab character for the 52,790 Salmonella genomes.
  • genomes-to-subspecies.txt: Each genome id to subspecies designation delimited by tab character for the 52,790 Salmonella genomes.
  • Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv: Serovar and antigenic formula information table used by `sistr_cmd` for looking up serovar designations from antigen results
  • sistr.msh: Mash sketch file of 11840 Salmonella genomes for Mash-based serotyping
  • antigens: for antigen gene search-based serotyping
    • fliC.fasta: fliC gene alleles for H1-antigen typing
    • fljB.fasta: fljB gene alleles for H2-antigen typing
    • wzx.fasta: wzx gene alleles for O-antigen typing
    • wzy.fasta: wzy gene alleles for O-antigen typing
  • cgmlst for core-genome multilocus sequence typing (cgMLST) and cgMLST-based serotyping
    • cgmlst-profiles.hdf: HDF5 file with cgMLST allelic profiles of 52,790 Salmonella genomes
      • read in with Pandas, i.e.
        pd.read_hdf(CGMLST_PROFILES_PATH, key='cgmlst')
    • cgmlst-centroid.fasta: "Centroid" or representative alleles of 52,790 Salmonella genomes for rapid NCBI BLAST+ blastn searching. Centroid alleles were defined from the full set of alleles for the 52,790 Salmonella genomes as the alleles for each locus:
      • group alleles by length
      • group length grouped alleles by ends (28bp at allele start and end; 28 is word size of blastn megablast)
      • hierarchical clustering of length+end grouped alleles
      • flat clusters at 2.5% distance
      • within each cluster, pick allele with least distance to others in cluster
  • cgmlst-full.fasta: alleles for the 52,790 Salmonella genomes

Files

genomes-to-serovar.txt

Files (348.7 MB)

Name Size Download all
md5:3459c4cb1d459d4670cef246b497914f
7.7 MB Download
md5:86f28499099b3ec10525ffe5ae287012
100.7 MB Download
md5:073ae146e9f729cbea59d27e6639024a
140.2 MB Download
md5:69854c38bf25873afc0bf48e26b1eda4
1.1 MB Download
md5:aea4912a7bfd01c1117cecc16a5170ed
214.5 kB Download
md5:1f262c4c2c7ed9cfdc8bda9b010f3279
1.2 MB Preview Download
md5:1e03cded94ea74c6910fde53914edd73
1.1 MB Preview Download
md5:4243da7ec8ab7bb2ab43860433c43603
120.4 kB Preview Download
md5:eaab468877783b83346efa11202d84fe
96.2 MB Download
md5:89172f9516eadc7cbb8538a2fd1be6f9
129.4 kB Download
md5:001792c2cb15dd0ea40114539309854e
61.0 kB Download

Additional details