Published October 1, 2025 | Version v2
Dataset Open

Quadrupling the protein family space with global metagenomics

  • 1. Ancilia Biosciences
  • 2. ROR icon Alexander Fleming Biomedical Sciences Research Center
  • 3. BioInnovation Greece
  • 4. Ηellenic Army Academy
  • 5. University of Athens Medical School

Description

Here you can find the data from the study "Quadrupling the protein family space with global metagenomics".

For Protein Families:

iso_clusters25_names.tsv.bz2

Description: TSV file of isolate clusters with more than 25 members.

Columns: (1)Cluster name (2)Representative isolate protein header (3) Isolate protein member header (4)Isolate protein member sequence

metag_clusters25_names.tsv.bz2

Description: TSV file of metagenomic clusters with more than 25 members.

Columns: (1) Cluster name (2) Representative metagenomic protein header (3) Metagenomic protein member header (4) Metagenomic protein member sequence

For Protein Folds     
structures.tar.gz

Description: Contains three subfolders with protein structure models

  1. HQ:High-Quality (pTM ≥ 0.7) models (26,767 pdb files)
  2. MQ:Medium-Quality (0.5 ≤ pTM < 0.7) models(47,823 pdb files)
  3.  LQ:Low-Quality (pTM < 0.5) models (82,018 pdb files)

foldseek_results.tar.gz

Description: You will find three folders (HQ, MQ & LQ). Each one contains three files: 

  1. AF2.tblout (unfiltered hits to AlphaFoldDB)
  2. CATH.tblout (unfiltered hits to CATH)
  3. PDB.tblout (unfiltered hits to PDB)

NMPFAMSDB2_MODELS_SCORES.txt

Description: A txt file with pTM and pLDDT score for each family model.

Columns:  (1) Family name Column (2) pTm score Column (3)pLDDT score

 

 

 

Files

NMPFAMSDB2_MODELS_SCORES.txt

Files (26.2 GB)

Name Size Download all
md5:254b7bf55518eada4b657278ebbff57d
2.9 GB Download
md5:d6ae4851bdb76667b233bb4a3bb2dc19
12.3 GB Download
md5:677b976daebebacc289667d9d1fd48b6
8.6 GB Download
md5:cde30f38f161004e57672b28fbc0d315
2.9 MB Preview Download
md5:93a8288cd6fde9b290ccbcfff09380e5
2.4 GB Download