Published September 22, 2025 | Version v8
Dataset Open

microbetag : building a thorough database of genome-scale KO annotations

  • 1. IMBBC - HCMR

Description

In this repository we keep internal data for the microbetag microbial co-occurrence network annotator.

microbetag makes use of 2-column files for each genome, indicating the KO term found and a KEGG module in which this terms takes part into.
As a single KO term might participates in more than one KEGG modules, the same KO might be more than once in an annotation file. 

chem_xref.tar.gz

The MNXref namespace

  1. The identifier of a chemical compound in an external resource [XREF]
  2. The corresponding identifier in the MNXref namespace [MNX_ID]
  3. The description given by the external resource [STRING]

MNXref 4.0 release notes: - The third column (evidence tag for the mapping) was suppressed - The descriptions were completed - Deprecated identifiers were moved into they own table below

gtdb_modelseed_gems.zip

for all the GTDB genomes their corresponding PATRIC annotations were gathered. Then, using modelseedpy we constructed their genome scale metabolic reconstructions

gtdb_kofam_scan_per_module.tar.gz

all representative genomes of GTDB (v.202) were parsed and their corresponding `.faa` files were retrieved from the NCBI FTP. Then the kofam_scan tool was used to annotate them and finally a manual script was used to keep KOs of each genome per module. 

SeedSet.pkl.gz

A pickle file with the seeds of each GEM included in the gtdb_modelseed_gems.zip file and related to the KEGG MODULES based on the seedId_keggId_module.tsv file you can find on microbetag's GitHub page.  Example:

PATRIC                                                                                        SeedSet
373.172    [cpd00891, cpd00136, cpd00199, cpd01772, cpd00...
397278.5   [cpd00891, cpd00136, cpd01772, cpd02698, cpd08...

NonSeedSet.pkl.gz

A pickle file with the non seeds of each GEM included in the gtdb_modelseed_gems.zip file and related to the KEGG MODULES based on the seedId_keggId_module.tsv file you can find on microbetag's GitHub page.  Example:

PATRIC                                                                                        NonSeedSet
64187.548   [cpd00508, cpd00869, cpd00774, cpd03830, cpd00...
74426.1719  [cpd00204, cpd00447, cpd20171, cpd03470, cpd00...

seeds_per_genome.pkl.gz

A pickle file with a binary representation of the seeds per genome .  Example:

                         cpd00493  cpd00296  cpd11431  cpd00063  cpd15717   ... 
2162051.4          1         0         0         1         0         0         0         0         0         0  ...     

nonseeds_per_genome.pkl.gz
Like above for non-seeds.
phen_classes.zip
A list of pickle files with the re-trained classes of phenDB for the prediction of functional traits on a genome.

 

 

 

 

Files

gtdb_modelseed_gems.zip

Files (6.0 GB)

Name Size Download all
md5:3ca5e33e743d2cd52ad32ab0d897129c
44.9 MB Download
md5:cbcc9aa1a28a5bd5f6661f832d27bcbf
307.3 MB Download
md5:e3e62b305e64b27da7b80655d7f92f2c
5.6 GB Preview Download
md5:00a91dae6e02347624bc5345191098ef
16.1 MB Download
md5:5c153b61e693eccf2eb81aa9f4b0272f
34.3 MB Download
md5:9e3f7a84fe7409ef0282ca5424797976
2.3 MB Preview Download
md5:77c2935f16517b68862938d78d1b1b3a
4.7 MB Download
md5:51ccafc2d2a2079718493a9487667f87
4.0 MB Download