Published June 6, 2024 | Version v2
Dataset Open

Benchmark dataset for CATH hierarchical clustering tools (GeMMA/FunFHMMEr, MARC, FRAN and eMMA)

  • 1. University College London

Description

Benchmark dataset for CATH SuperFamily 3.40.50.620 (HUPS).

Contains Functional Families alignments and Hidden Markov Models generated by GeMMA/FunFHMMER, MARC, FRAN and CATH-eMMA and Python code used to assess their quality (EC purity, DOPS, Neff) and intermediate steps by the MARC and FRAN pipelines (pooling, randomisation, renaming).

3.4.50.620_full_superfamily_sequences.fasta contains all HUPs superfamily sequences, the FunFams are a subset of these.

all_starting_clusters_sequences.fasta contain the sequences included in the starting clusters used in the analyses.

3.40.50.620_embedded.pt includes embeddings for the HUPs superfamily generated using the ESM2 Protein Language Model.

 

Files

Files (197.5 MB)

Name Size Download all
md5:433547804445216fa7045c51f35657ab
19.8 MB Download
md5:eb15bea9a23f7f8154b18e93d20ffb46
146.5 MB Download
md5:51fcc0ec83e5ffa4ff11fcc25ac137c0
9.8 MB Download
md5:eb9a4d07428eed7ded5b0860ec207a22
21.4 MB Download