Annotated sequences extracted from bacterial genomes

Wright, Erik Scott

doi:10.5281/zenodo.7970590

Published May 25, 2023 | Version v1

Dataset Open

Annotated sequences extracted from bacterial genomes

Wright, Erik Scott¹

1. University of Pittsburgh

Three files containing sequences extracted from 1,049,210 bacterial genomes available from GenBank (release 252). Protein coding sequences were annotated with IDTAXA (PMID: 34541527) using taxon-specific KEGG groups (Bacteria_Protein_subset.fas.gz). These annotations were transferred to their corresponding (nucleotide) coding sequences (Bacteria_Nucleotide_subset.fas.gz). Intergenic regions were extracted from each genome and annotated by FindNonCoding (PMID: 34636849) for their overlap with any of 25 common bacterial non-coding RNAs in Rfam (v14). Intergenic regions were required to be at least 100 nucleotides long and contain no ambiguities (Bacteria_Intergenic_subset.fas.gz). Each subset contains only distinct sequences randomly ordered.

Headers

Sequence headers contain the assembly accession followed by the annotation and separated by a "|" character. For example:

Bacteria_Intergenic_subset.fas.gz

>GCA_022121725.1|RF00000
ATGTTACCTTCTTGAGTGATACGGGATGAA[...]

Bacteria_Protein_subset.fas.gz

>GCA_014764685.1|K02049
MPRDLIRISGLEKTYADGSVHALSNIDLSIKD[...]

Bacteria_Nucleotide_subset.fas.gz

>GCA_015948525.1|K02197
GTGAACCTGCGACGTAAAAACCGGCTAYG[...]

Annotations

Protein and protein coding sequences are labeled with their KEGG group, starting with "K". Intergenic sequences are named by any overlapping Rfam families, starting with a "RF", and separated by commas when multiple are predicted. "RF00000" is a placeholder for the absence of any predicted RF families.

Files

Files (38.7 GB)

Name	Size	Download all
Bacteria_Intergenic_subset.fas.gz md5:3143ad47f53edb43aba11bca1b6c9473	6.1 GB	Download
Bacteria_Nucleotide_subset.fas.gz md5:1a0b13a1c7c209e6d904cd162a2172a5	23.1 GB	Download
Bacteria_Protein_subset.fas.gz md5:22cb79c63c57e2ec61d51e20aeeb53de	9.5 GB	Download

	All versions	This version
Views	64	64
Downloads	84	84
Data volume	1.1 TB	1.1 TB

Annotated sequences extracted from bacterial genomes

Authors/Creators

Description

Files

Files (38.7 GB)