Loci ID correspondence and loci subsets for the schemas deposited in Chewie-NS
Description
Dataset contents
This dataset includes files with the loci subsets, such as core loci subsets for cgMLST analysis, and the correspondence between the legacy loci IDs used by the discontinued Chewie-NS instance and the loci IDs used by the latest Chewie-NS instance. Each ZIP archive includes files for the schemas of a specific species. The contents of each ZIP archive are the following:
- species1_Spyogenes.zip -- contains the files for the schemas of the species with ID=1 (Streptococcus pyogenes).
- species1_Spyogenes_schema1 -- contains the files for the schema with ID=1.
- species1_Spyogenes_schema1_loci_IDs_mapping.tsv -- contains the loci ID correspondence between the loci IDs used by the current instance of Chewie-NS and the original loci IDs.
- species1_Spyogenes_schema1_cgMLST95_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for the core loci defined based on a loci presence threshold of 95%.
- species1_Spyogenes_schema1_cgMLST95_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the core loci, defined based on a loci presence threshold of 95%, used by the current instance of Chewie-NS and the original loci IDs.
- species1_Spyogenes_schema1_cgMLST99_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for the core loci defined based on a loci presence threshold of 99%.
- species1_Spyogenes_schema1_cgMLST99_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the core loci, defined based on a loci presence threshold of 99%, used by the current instance of Chewie-NS and the original loci IDs.
- species1_Spyogenes_schema1_cgMLST100_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for the core loci defined based on a loci presence threshold of 100%.
- species1_Spyogenes_schema1_cgMLST100_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the core loci, defined based on a loci presence threshold of 100%, used by the current instance of Chewie-NS and the original loci IDs.
- species1_Spyogenes_schema1_Transcriptional_Regulators_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for a set of transcriptional regulators.
- species1_Spyogenes_schema1_Transcriptional_Regulators_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the transcriptional regulators used by the current instance of Chewie-NS and the original loci IDs.
- species1_Spyogenes_schema1_Virulence_Factors_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for a set of virulence factors.
- species1_Spyogenes_schema1_Virulence_Factors_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the virulence factors used by the current instance of Chewie-NS and the original loci IDs.
- species1_Spyogenes_schema1 -- contains the files for the schema with ID=1.
- species10_Ecoli.zip -- contains the files for the schemas of the species with ID=10 (Escherichia coli).
- species10_Ecoli_schema1 -- contains the files for the schema with ID=1 (more information about the schema creation process and the definition of the loci subsets is available here).
- species10_Ecoli_schema1_loci_IDs_mapping.tsv -- contains the loci ID correspondence between the loci IDs used by the current instance of Chewie-NS, the original loci IDs, and the loci IDs used in the first instance of Chewie-NS (discontinued on July 2025).
- species10_Ecoli_schema1_cgMLST99_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for the core loci defined based on a loci presence threshold of 99%.
- species10_Ecoli_schema1_cgMLST99_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the core loci, defined based on a loci presence threshold of 99% as described here, used by the current instance of Chewie-NS, the original loci IDs, and the loci IDs used in the first instance of Chewie-NS (discontinued on July 2025).
- species10_Ecoli_schema1 -- contains the files for the schema with ID=1 (more information about the schema creation process and the definition of the loci subsets is available here).
- species14_Senterica.zip -- contains the files for the schemas of the species with ID=14 (Salmonella enterica).
- species14_Senterica_schema1 -- contains the files for the schema with ID=1 (more information about the schema creation process and the definition of the loci subsets is available here).
- species14_Senterica_schema1_loci_IDs_mapping.tsv -- contains the loci ID correspondence between the loci IDs used by the current instance of Chewie-NS, the original loci IDs, and the loci IDs used in the first instance of Chewie-NS (discontinued on July 2025).
- species14_Senterica_schema1_cgMLST99_loci_IDs.txt -- contains the list of loci IDs used by the current instance of Chewie-NS for the core loci defined based on a loci presence threshold of 99%.
- species14_Senterica_schema1_cgMLST99_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the core loci, defined based on a loci presence threshold of 99% as described here, used by the current instance of Chewie-NS, the original loci IDs, and the loci IDs used in the first instance of Chewie-NS (discontinued on July 2025).
- species14_Senterica_schema1 -- contains the files for the schema with ID=1 (more information about the schema creation process and the definition of the loci subsets is available here).
- species18_Lmonocytogenes.zip -- contains the files for the schemas of the species with ID=18 (Listeria monocytogenes).
- species18_Lmonocytogenes_schema1 -- contains the files for the schema with ID=1 (corresponding to the Institut Pasteur Listeria moncytogenes cgMLST schema described in Moura et al, 2016, available at https://bigsdb.pasteur.fr/listeria/).
- species18_Lmonocytogenes_schema1_loci_IDs_mapping.tsv -- contains the correspondence between the loci IDs for the core loci used by the current instance of Chewie-NS, the original loci IDs, and the loci IDs used in the first instance of Chewie-NS (discontinued on July 2025).
- species18_Lmonocytogenes_schema1 -- contains the files for the schema with ID=1 (corresponding to the Institut Pasteur Listeria moncytogenes cgMLST schema described in Moura et al, 2016, available at https://bigsdb.pasteur.fr/listeria/).
Converting legacy loci IDs to the latest loci IDs
It is possible to convert the legacy loci IDs in results files generated with schemas downloaded from the discontinued Chewie-NS instance to the loci IDs used by the latest Chewie-NS instance using the convert_ids.py Python script included in this dataset. This script converts any legacy loci IDs in results files to the loci IDs used by the latest instance of Chewie-NS. The script accepts a single results file (e.g., files generated by chewBBACA's AlleleCall module, such as the results_alleles.tsv or loci_summary_stats.tsv files) and a TSV file with the loci ID correspondence. Each of the files below has the loci ID correspondence between legacy and latest schemas for the following species:
- Escherichia coli (
species10_Ecoli_schema1_loci_IDs_mapping.tsv) - Salmonella enterica (
species14_Senterica_schema1_loci_IDs_mapping.tsv) - Listeria monocytogenes (
species18_Lmonocytogenes_schema1_loci_IDs_mapping.tsv)
As an example, to convert legacy loci IDs in a results file for E. coli, such as the results_alleles.tsv file containing allelic profiles, with the following contents:
| FILE | INNUENDO_wgMLST-00016024 | INNUENDO_wgMLST-00016025 | INNUENDO_wgMLST-00016026 |
| Genome1 | 1 | 2 | 1 |
| Genome2 | 2 | 2 | 2 |
| Genome3 | 1 | 1 | 1 |
All that is necessary is to run the following command:
python convert_ids.py -i results_alleles.tsv -it species10_Ecoli_schema1_loci_IDs_mapping.tsv
The script will substitute all legacy loci IDs by the loci IDs used by the latest instance of Chewie-NS, resulting in the following file contents:
| FILE | wgMLST-00027274 | wgMLST-00027275 | wgMLST-00027276 |
| Genome1 | 1 | 2 | 1 |
| Genome2 | 2 | 2 | 2 |
| Genome3 | 1 | 1 | 1 |
The script can be used to convert loci IDs in any file that includes legacy loci IDs. It is also possible to convert back to the legacy loci IDs by providing the --invert option. To view the full usage instructions for the script, run the following command:
python convert_ids.py -h
Files
species10_Ecoli.zip
Files
(261.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:73ce0f140d8002dd9248fda66d4d63a9
|
2.5 kB | Download |
|
md5:541b82d853b9e82c7b3a4db618fe97cd
|
89.9 kB | Preview Download |
|
md5:6009b1678acb6eba41a1b0b4b451c3fb
|
105.2 kB | Preview Download |
|
md5:dbefa35d7d5bf653e72110ba3d95aecf
|
14.5 kB | Preview Download |
|
md5:6ceb942af30ea9da74bc32ab7623f7ea
|
48.9 kB | Preview Download |