Annotated sequences extracted from bacterial genomes
Description
Three files containing sequences extracted from 1,049,210 bacterial genomes available from GenBank (release 252). Protein coding sequences were annotated with IDTAXA (PMID: 34541527) using taxon-specific KEGG groups (Bacteria_Protein_subset.fas.gz). These annotations were transferred to their corresponding (nucleotide) coding sequences (Bacteria_Nucleotide_subset.fas.gz). Intergenic regions were extracted from each genome and annotated by FindNonCoding (PMID: 34636849) for their overlap with any of 25 common bacterial non-coding RNAs in Rfam (v14). Intergenic regions were required to be at least 100 nucleotides long and contain no ambiguities (Bacteria_Intergenic_subset.fas.gz). Each subset contains only distinct sequences randomly ordered.
Headers
Sequence headers contain the assembly accession followed by the annotation and separated by a "|" character. For example:
Bacteria_Intergenic_subset.fas.gz
>GCA_022121725.1|RF00000
ATGTTACCTTCTTGAGTGATACGGGATGAA[...]
Bacteria_Protein_subset.fas.gz
>GCA_014764685.1|K02049
MPRDLIRISGLEKTYADGSVHALSNIDLSIKD[...]
Bacteria_Nucleotide_subset.fas.gz
>GCA_015948525.1|K02197
GTGAACCTGCGACGTAAAAACCGGCTAYG[...]
Annotations
Protein and protein coding sequences are labeled with their KEGG group, starting with "K". Intergenic sequences are named by any overlapping Rfam families, starting with a "RF", and separated by commas when multiple are predicted. "RF00000" is a placeholder for the absence of any predicted RF families.
Files
Files
(38.7 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:3143ad47f53edb43aba11bca1b6c9473
|
6.1 GB | Download |
|
md5:1a0b13a1c7c209e6d904cd162a2172a5
|
23.1 GB | Download |
|
md5:22cb79c63c57e2ec61d51e20aeeb53de
|
9.5 GB | Download |