Published August 6, 2024 | Version v1
Dataset Open

BacSPaD: A robust bacterial strains' pathogenicity resource based on integrated and curated genomic metadata

  • 1. ROR icon Université Claude Bernard Lyon 1

Description

The vast array of omics data in microbiology presents significant opportunities for studying bacterial pathogenesis and creating computational tools for predicting pathogenic potential. However, the field lacks a comprehensive, curated resource that catalogs bacterial strains and their ability to cause human infections. Current methods for identifying pathogenicity determinants often introduce biases and miss critical aspects of bacterial pathogenesis.
In response to this gap, we introduce BacSPaD (Bacterial Strains’ Pathogenicity Database), a thoroughly curated database focusing on pathogenicity annotations for a wide range of high-quality, complete bacterial genomes. Our rule-based annotation workflow combines metadata from trusted sources with automated keyword matching, extensive manual curation, and detailed literature review. Our analysis classified 5,502 genomes as pathogenic to humans (HP) and 490 as non-pathogenic to humans (NHP), encompassing 532 species, 193 genera, and 96 families. Statistical analysis demonstrated a significant but moderate correlation between virulence factors and HP classification, highlighting the complexity of bacterial pathogenicity and the need for ongoing research. This resource is poised to enhance our understanding of bacterial pathogenicity mechanisms and aid in the development of predictive models. To improve accessibility and provide key visualization statistics, we developed a user-friendly web interface, accessible at https://bacspad.altrabio.com.

Technical info

Metadata fields

Description

pathogenicity_label

Labeling according to pathogenicity - either non-pathogenic to humans (NHP) or pathogenic to humans (HP).

genome_id

Genome ID from Bacterial and Viral Bioinformatics Resource Center (BV-BRC) database.

genome_name

Genome name.

strain

Strain name according to National Center for Biotechnology Information (NCBI) taxonomy.

species

Species name according to NCBI taxonomy.

genus

Genus name according to NCBI taxonomy.

family

Family name according to NCBI taxonomy.

order

Order name according to NCBI taxonomy.

class

Class name according to NCBI taxonomy.

phylum

Phylum name according to NCBI taxonomy.

biosample_accession

Biosample accession number ID from NCBI.

taxon_id

Taxon ID from NCBI taxonomy.

serovar

Taxonomy below subspecies; a variant which is usually based on its antigenic properties. Same as serotype (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

biovar

Variant distinguished by its unique biochemical or physiological traits (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

pathovar

Taxonomy below subspecies; a variety usually based on its pathogenic properties. Sometimes used as equivalent to subspecies. (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/)

mlst

Genotypic identifier based on housekeeping gene sequences.

other_typing

Strain typing or characterization methods beyond the standard approaches such as MLST (Multilocus Sequence Typing). Each "genotype" followed by a number (e.g., genotype:1 or genotype:1903) denotes a unique genetic profile or pattern that has been identified in the microbial species under investigation.

culture_collection

Reference to a deposited microbial strain in a repository, identified by a unique accession number.

type_strain

Indication if it is a type strain (‘yes’ or ‘’). A type strain is a nomenclatural standard for a particular bacterial species, serving as a reference point for its definition and identification.

completion_date

Date of project completion.

publication

Associated scientific publication identifier.

bioproject_accession

Unique identifier to corresponding project in NCBI.

assembly_accession

Unique identifier to corresponding genome assembly in NCBI. Refers to a specific version of a genome assembly submitted to a database like NCBI's GenBank. 

genbank_accessions

Unique identifier(s) of GenBank assembly/assemblies in NCBI. 

refseq_accessions

Unique identifiers assigned to sequences within the Reference Sequence (RefSeq) database. RefSeq sequences are curated by NCBI staff and collaborators.

sequencing_centers

Sequencing center (e.g. University ‘x’, Hospital ‘y’).

sequencing_platform

Sequencing platform (e.g. Illumina, PacBio).

sequencing_depth

Average number of times each nucleotide in a genome is sequenced. 

assembly_method

Methodology used to assemble the genomic sequences. 

chromosomes

Number of associated chromosomes.

plasmids

Number of associated plasmids.

contigs

Number of associated contigs.

genome_length

Genome length measured in base pairs (bp). 

gc_content

Measure of the proportion of guanine (G) and cytosine (C) nucleotides in the DNA sequence, expressed as a percentage of the total nucleotide composition.

patric_cds

Number of protein-coding sequences (CDS) annotated or sourced from PATRIC (previous version of BV-BRC).

refseq_cds

Number of protein-coding sequences (CDS) annotated or sourced from RefSeq database.

isolation_source

Corresponding origin of isolation. This attribute provides information about the ecological niche or source of the bacterial strain.

isolation_comments

Additional notes or comments regarding the isolation of a specific bacterial strain.

collection_date

Date on which a specific bacterial strain was collected or isolated from its source.

isolation_country

Country associated with the biological sample isolation. 

geographic_location

Geographical descriptors associated with the biological sample isolation.

other_environmental

Supplementary attribute to describe specific environmental conditions or contexts associated with the biological sample.

host_gender

Host gender.

host_age

Host age.

host_health

Host health status or condition. 

body_sample_site

Specific anatomical site or location from which the biological sample was collected.

other_clinical

Additional clinical information or metadata associated with the biological sample.

antimicrobial_resistance

This field shows genomes that have been specifically tested against certain antibiotics and the resulting phenotype from that test. Note that a genome can have multiple antibiotic phenotypes, such as being resistant to one drug and susceptible to another. Values in this field include ‘Resistant’,’Susceptible’ or ‘Intermediate’ (https://www.bv-brc.org/docs/quick_references/organisms_taxon/antimicrobial_resistance.html).

antimicrobial_resistance_evidence

Indicates the information source behind the AMR designation. Allowable values include "Computational Prediction”, "Computational Method" , and "AMR Panel” (https://www.bv-brc.org/docs/quick_references/organisms_taxon/antimicrobial_resistance.html)

gram_stain_bvbrc

Gram staining information (“positive” or “negative”) sourced from BV-BRC.

cell_shape

Cell shape information (e.g. Bacilli, Cocci).

motility

Motility information (“yes”: motile, “no”: non-motile).

temperature_range

Indication on phenotype associated with range of temperature at which the organism is known to thrive, survive, or exhibit optimal growth (e.g. ‘Mesophilic’).

optimal_temperature

Optimal temperature at which the organism is known to exhibit optimal growth.

oxygen_requirement

Specific oxygen conditions a microorganism requires to survive; Values include ‘Aerobic’, ‘Anaerobic’, ‘Facultative’, or ‘Microaerophilic’.

habitat

Natural or artificial habitat in which the bacteria resides or was found.

disease

Host disease.

comments

Supplementary information in form of comments providing further contextual details.

additional_metadata

Supplementary metadata providing further contextual details.

env_broad_scale

Broad-scale environmental context (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

env_local_scale

Local-scale environmental context (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

env_medium

Environmental medium/material. keywords describing the material displaced by the entity during sampling (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

isol_growth_condt

Description or url indication of isolation and growth condition specifications (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

project_name

A concise name that describes the overall project (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

pathogenicity_details

Additional bacterial strain information on pathogenicity (e.g;. ‘commensal’, or ‘diphtheria-like symptoms’).

host_disease

Name of relevant disease, e.g. Salmonella gastroenteritis (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

host_health_state

Information regarding health state of the individual sampled at the time of sampling (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

host_disease_outcome

Final outcome of disease, e.g., death, chronic disease, recovery.

host_description

Additional host information not included in other defined vocabulary fields (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

host_disease_stage

Stage of disease at the time of sampling (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

pathotype

Bacterial specific pathotype (e.g. Eschericia coli - STEC, UPEC) - https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/.

subsource_note

Subsource note.  Further details about the origin, isolation method, or other relevant information regarding the sample used.

note

Additional note. This can include details about the source of the sequence, experimental conditions, characteristics of the organism, or any other relevant information.

description

Further details on isolation source or organism.

biotic_relationship

Observed biotic relationship (['', 'free living', 'parasite', 'commensal', ‘symbiont’]) - https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/

biome

Major environment type(s) where sample was collected (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/).

host_status

Information on host health status.

risk_group

Risk group classification - based on their potential hazard to human health and the environment (species-level, ranges from 1 to 3; 3 representing the highest hazard).

Note on infection mode

Further details on infection mode.

checkm_compl_final

Genome completeness (%) according to CheckM tool v1.1.6. 

checkm_contam_final

Genome contamination (%) according to CheckM tool v1.1.6. 

disease_category

Disease category (e.g. “Respiratory diseases”). 

disease_subcategory

Subcategory of the main disease category (e.g. “Pneumonia”). When the specific infectious disease name is not available, an associated keyword is given instead (e.g. “Pertussis”).

isolation_source_category

Isolation source category (e.g. “Respiratory tract”). 

disease_comb

Combination of the disease category and disease subcategory (e.g. Respiratory Diseases - Pneumonia).

     
       
       
       
       
       
       
       
       
       

Files

Genomes_labeled.csv

Files (6.9 MB)

Name Size Download all
md5:deabb2ceb23c64fc785f3dafd6a3a4c5
5.2 MB Preview Download
md5:dbf1fdd06d9d4ff8cf70ef5a66bd3feb
1.7 MB Preview Download

Additional details

Related works

Is described by
Publication: 10.20944/preprints202407.0837.v1 (DOI)

Funding

European Commission
PEST-BIN - Pioneering Strategies Against Bacterial Infections 955626

Dates

Available
2024-07-12

Software

References

  • Ribeiro, S., Chaumet, G., Alves, K., Nourikyan, J., Shi, L., Lavergne,J.-P., Mijakovic, I., de Bernard, S., & Buffat, L. (2024). BacSPaD: A robust bacterial strains' pathogenicity resource based on integrated and curated genomic metadata. Preprints, 202407.0837.v1. https://doi.org/10.20944/preprints202407.0837.v1