Quadrupling the protein family space with global metagenomics
Authors/Creators
Description
Here you can find the data from the study "Quadrupling the protein family space with global metagenomics".
For Protein Families:
iso_clusters25_names.tsv.bz2
Description: TSV file of isolate clusters with more than 25 members.
Columns: (1)Cluster name (2)Representative isolate protein header (3) Isolate protein member header (4)Isolate protein member sequence
metag_clusters25_names.tsv.bz2
Description: TSV file of metagenomic clusters with more than 25 members.
Columns: (1) Cluster name (2) Representative metagenomic protein header (3) Metagenomic protein member header (4) Metagenomic protein member sequence
For Protein Folds
structures.tar.gz
Description: Contains three subfolders with protein structure models
- HQ:High-Quality (pTM ≥ 0.7) models (26,767 pdb files)
- MQ:Medium-Quality (0.5 ≤ pTM < 0.7) models(47,823 pdb files)
- LQ:Low-Quality (pTM < 0.5) models (82,018 pdb files)
foldseek_results.tar.gz
Description: You will find three folders (HQ, MQ & LQ). Each one contains three files:
- AF2.tblout (unfiltered hits to AlphaFoldDB)
- CATH.tblout (unfiltered hits to CATH)
- PDB.tblout (unfiltered hits to PDB)
NMPFAMSDB2_MODELS_SCORES.txt
Description: A txt file with pTM and pLDDT score for each family model.
Columns: (1) Family name Column (2) pTm score Column (3)pLDDT score