Published May 6, 2022 | Version 2.0
Dataset Open

extHomFam 2: large-scale benchmark for protein multiple sequence alignments

  • 1. Silesian University of Technology

Description

extHomFam 2 was constructed by combining Homstrad reference alignments (March 2020) with Pfam 33.1 complete families (NCBI variant). Homstrad entries with less than 3 reference sequences and those pointing to dead Pfam families were discarded. The resulting benchmark was divided into subsets depending on the family size N:

subset N range # families
small [200, 10 000) 86
medium [10 000, 40 000) 95
large [40 000, 100 000) 83
xlarge [100 000, 250 000) 67
huge [250 000, 3 000 000) 62

The directories in the archive correspond to the names of the subsets, while the reference alignments are located in 'ref' folder.

 

Files

extHomFam-v2.zip

Files (3.9 GB)

Name Size Download all
md5:40caec9385955447b110e0c1ccb3fa9d
3.9 GB Preview Download

Additional details

Related works

Continues
Journal article: 10.1038/srep33964 (DOI)
Is new version of
Dataset: 10.7910/DVN/BO2SVW (DOI)

References

  • Deorowicz, S., Debudaj-Grabysz, A. & Gudyś, A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep 6, 33964 (2016). https://doi.org/10.1038/srep33964