AMPSphere : the worldwide survey of prokaryotic antimicrobial peptides

Santos-Júnior, Célio Dias; Duan, Yiqian; Chong, Hui; Schmidt, Thomas S.B.; Fullam, Anthony; Bork, Peer; Zhao, Xing-Ming; Coelho, Luis Pedro

doi:10.5281/zenodo.6511404

Published May 2, 2022 | Version v.2022-03

Dataset Open

AMPSphere : the worldwide survey of prokaryotic antimicrobial peptides

1. Institute of Science and Technology for Brain-Inspired Intelligence - ISTBI, Fudan University, Shanghai, China
2. Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany

AMPSphere v.2022-03: the worldwide survey of prokaryotic antimicrobial peptides

INTRODUCTION

AMPSphere is a comprehensive catalog of antimicrobial peptides predicted using Macrel (DOI: 10.7717/peerj.10555) from 63,410 public metagenomes, ProGenomes v2.2 database (82,400 high-quality microbial genomes), and c.a. 4k non-whitelisted microbial genomes from NCBI. Currently, AMPSphere is available as a web resource at https://ampsphere.big-data-biology.org/.

GENERATION

Peptides were predicted using Macrel. Singleton peptides were removed, except those with a direct hit to DRAMP
database. Redundant peptides were coded using a reduced alphabet and hierarchically clustered using CD-HIT (version 4.6) at 100%, 85%, and 75% of amino acid identity (and 90% of overlap of the shorter peptide). The obtained clusters were numbered by decreasing size (number of peptides). Each level of clustering was called a SPHERE. Redundant nucleotide sequences for the gene variants of different AMPs also were included in this version of AMPSphere.

STATISTICS

AMPSphere v.2022-03 contains 863,498 sequences (avg length: 36 amino acids, range 8-98). DRAMP database was used to find confirmed sequences with strict homology to reference. This approach showed that 2,488 peptides were previously confirmed in our dataset.

IDENTIFIERS

Peptides are named in the form >AMP10.XXX_XXX where XXX_XXX is a unique numerical identifier (starting at zero). Numbers were assigned in order of increasing number of copies. So that the lower the number, the greater number of copies of that peptide were present in the input data. Annotations were also provided as separated fields in the fasta file, containing their:

- SPHERE families at level III (corresponding to hierarchically obtained clusters using 100-85-75% of identity with a minimum overlap of 90% of the shorter gene).

Example of header:

>AMP10.000_000 | SPHERE-III.001_493

VERSION DETAILS

Version 2022-03 includes:

- quality assessment of documented AMPs,

- metadata associated with the genes,

- a better taxonomic identification of AMP sources using GTDB.

WARNING: Due to a different procedure of AMP sorting, now some entries and families may have changed their accessions.

FILES WITHIN THIS VERSION

README.md
This file.

AMPsphere_v.2022-03.fna.xz
Multi-fasta with AMPSphere gene sequences (nucleotide).

AMPsphere_v.2022-03.faa.gz

Multi-fasta with AMPSphere peptide sequences (amino acid).

SPHERE_v.2022-03.levels_assessment.tsv.gz

TSV table relating AMP name and the hierarchically obtained clusters per level. Columns:

- AMP accession
- evaluation vs. representative
- SPHERE_fam level I
- SPHERE_fam level II
- SPHERE_fam level III

Levels of each SPHERE family:

I: contains clusters obtained with 100% of identity cut-off and 90% of overlap of the shorter sequence;

II: contains clusters obtained with the unclustered sequences and the representatives from level I at 85% of identity and 90% of overlap of the sorter sequence;

III: contains clusters obtained with the unclustered sequences and the representatives from level II at 75% of identity and 90% of overlap of the sorter sequence;

`evaluation vs. representatives` shows the percent of identity the sequence has in an alignment against the cluster representative, and also the overlap in percent.

Example:

* -- This means: this sequence is a cluster representative.

OR something like this:

    77.50%,1:40:1:40 -- This means: alignment identity against the
                        representative of the cluster equals 77.5% and the
                        alignment start and end position for the query (1 and
                        40, respectively), and target (1 and 40, respectively).

AMPSphere_v.2022-03.quality_assessment.tsv.gz
TSV table containing the results of each quality test (by sequence). Columns:

- AMP ID
- Antifam
- RNAcode
- Metaproteomes
- Metatranscriptomes
- Coordinates

Results are one of 'Passed', 'Failed', or 'Not tested'.

Antifam results show if the sequence matches ('Fail') or does not match ('Pass') to Antifams, a set of well-known spurious ORFs.

RNAcode relies on gene diversity, therefore, families with less than 3 different gene sequences could not be tested and were marked as such.

The direct match of 50% of our peptide to transcripts (in at least 2 different samples) or peptides from meta-omics studies sampled from different environments assigned the peptide as passing the metatranscriptomes and metaproteomes tests, respectively.

Finally, the coordinates test check if the start of the small ORF happens with at least one stop codon upstream, this ensures that the gene is not a fragment from a larger protein.

AMPSphere_v.2022-03.general_geneinfo.tsv.gz
TSV table relating AMP, gene name, the microbial source, sample, environment, and geographical location. Columns:

- gmsc (gene code access)
- amp
- sample (biosample)
- source (microbial origin, GTDB taxonomy)
- specI (species cluster according to ProGenomes v.2 classification)
- is_metagenomic (False if comming from a high-quality microbial genome)
- geographic_location
- latitude
- longitude
- general_envo_name
- environment_material

CONTACT

You can contact us via our discussion group: https://groups.google.com/g/ampsphere-users

AMPsphere main developers:

- Célio Dias Santos Júnior
- Yiqian Duan
- Hui Chong
- Luis Pedro Coelho

COPYRIGHT NOTICE

AMPSphere v.2022-03 - the worldwide survey of prokaryotic antimicrobial peptides.

This work is a joint effort of Big Data Biology group from the Institute of Science and Technology for Brain-Inspired Intelligence (ISTBI) - Fudan University, Shanghai, China, and the Structural and Computational Biology Unit
(Heidelberg) - European Molecular Biology Laboratory (EMBL).

   AMPSphere IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
   IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
   DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
   OTHERWISE,ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
   USE OR OTHER DEALINGS IN THE SOFTWARE.

This database is free; you can redistribute it and/or modify it
as you wish, under the terms of the CC BY 4.0 license.

You are allowed to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose,
even commercially.

You may also obtain a copy of the CC BY 4.0 license here:

https://creativecommons.org/licenses/by/4.0/

REFERENCES CITED

- Macrel: Santos-Júnior CD, Pan S, Zhao X, Coelho LP. 2020. Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ 8:e10555. https://doi.org/10.7717/peerj.10555

- ProGenomes: Mende DR, Letunic I, Maistrenko OM et al. 2020. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Research 48(D1): D621–D625. https://doi.org/10.1093/nar/gkz1002

- DRAMP: Kang X, Dong F, Shi C et al. 2019. DRAMP 2.0, an updated data repository of antimicrobial peptides. Sci Data 6, 148. https://doi.org/10.1038/s41597-019-0154-y

- ANTIFAM: Eberhardt RY, Haft DH, Punta M, Martin M, O’Donovan C, BatemanA. 2012. AntiFam: a tool to help identify spurious ORFs in protein annotation. Database, Bas003.

- RNAcode: Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. 2011. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17(4):578-94.

Notes

This work was supported by the National Key R&D Program of China (2020YFA0712403, 2018YFC0910500), the National Natural Science Foundation of China (61932008, 61772368), the Shanghai Science and Technology Innovation Fund (19511101404 and the Shanghai Municipal Science and Technology Major Project (2018SHZDZX01). The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the dataset.

Files

README.md

Files (273.8 MB)

Name	Size	Download all
AMPSphere_v.2021-03_families_tree_nwk.tar.gz md5:69e958a4e1c05fded6c20bc08a7c18e0	2.5 MB	Download
AMPSphere_v.2022-03.faa.gz md5:783e51d99217cd45f014f7de6c166f3e	24.3 MB	Download
AMPSphere_v.2022-03.fna.xz md5:22f7208f1800ce2545da8643354d8cc6	96.3 MB	Download
AMPSphere_v.2022-03.general_geneinfo.tsv.gz md5:821acfead28ec785023c0495fac64a17	115.7 MB	Download
AMPSphere_v.2022-03.quality_assessment.tsv.gz md5:1843424f80864fb2ab6575dd6b3252e2	2.7 MB	Download
DRAMP_anno_AMPSphere_v.2021-03.parsed.tsv.gz md5:bea7d1cb04fe2a57c40ecc87ffd3b86c	296.4 kB	Download
README.md md5:3900512937f6cf353c10129fe4e994e2	8.1 kB	Preview Download
SPHERE_v.2021-03.levels_assessment.tsv.gz md5:030cc43a8b1d6defb8070344f445dc04	19.2 MB	Download
SPHERE_v.2022-03.levels_assessment.tsv.gz md5:a1c45d76db598fa861f8725380f87c74	12.9 MB	Download

	All versions	This version
Views	2,981	946
Downloads	1,987	1,147
Data volume	57.2 GB	33.3 GB

AMPSphere : the worldwide survey of prokaryotic antimicrobial peptides

Creators

Description

Notes

Files

README.md

Files (273.8 MB)