Published May 25, 2023 | Version v1
Dataset Open

Annotated sequences extracted from bacterial genomes

Authors/Creators

  • 1. University of Pittsburgh

Description

Three files containing sequences extracted from 1,049,210 bacterial genomes available from GenBank (release 252). Protein coding sequences were annotated with IDTAXA (PMID: 34541527) using taxon-specific KEGG groups (Bacteria_Protein_subset.fas.gz). These annotations were transferred to their corresponding (nucleotide) coding sequences (Bacteria_Nucleotide_subset.fas.gz). Intergenic regions were extracted from each genome and annotated by FindNonCoding (PMID: 34636849) for their overlap with any of 25 common bacterial non-coding RNAs in Rfam (v14). Intergenic regions were required to be at least 100 nucleotides long and contain no ambiguities (Bacteria_Intergenic_subset.fas.gz). Each subset contains only distinct sequences randomly ordered.

Headers

Sequence headers contain the assembly accession followed by the annotation and separated by a "|" character. For example:

Bacteria_Intergenic_subset.fas.gz

>GCA_022121725.1|RF00000
ATGTTACCTTCTTGAGTGATACGGGATGAA[...]

Bacteria_Protein_subset.fas.gz

>GCA_014764685.1|K02049
MPRDLIRISGLEKTYADGSVHALSNIDLSIKD[...]

Bacteria_Nucleotide_subset.fas.gz

>GCA_015948525.1|K02197
GTGAACCTGCGACGTAAAAACCGGCTAYG[...]

Annotations

Protein and protein coding sequences are labeled with their KEGG group, starting with "K". Intergenic sequences are named by any overlapping Rfam families, starting with a "RF", and separated by commas when multiple are predicted. "RF00000" is a placeholder for the absence of any predicted RF families.

Files

Files (38.7 GB)

Name Size Download all
md5:3143ad47f53edb43aba11bca1b6c9473
6.1 GB Download
md5:1a0b13a1c7c209e6d904cd162a2172a5
23.1 GB Download
md5:22cb79c63c57e2ec61d51e20aeeb53de
9.5 GB Download