There is a newer version of the record available.

Published July 20, 2025 | Version v1.1
Dataset Open

SNAC-DB: Structural NANOBODY® (VHH) and Antibody (VH-VL) Complex Database

  • 1. Large Molecule Research, Sanofi, Cambridge, MA, United States
  • 2. Department of Chemical and Biomolecular Engineering, Johns Hopkins University, MD, United States
  • 3. Large Molecule Research, Sanofi, Frankfurt, Germany
  • 4. R&D Data & Computational Science, Sanofi, Cambridge, MA, United States

Description

Welcome to the SNAC-DB — a comprehensive and curated resource of antibody and NANOBODY® VHH structures designed to support computational modeling, machine learning, structural biology research, and available in ML-ready formats. This release includes dataset curated by using the SNAC-DB pipeline (https://github.com/Sanofi-Public/SNAC-DB) on protein structures sourced from the RCSB PDB (https://www.rcsb.org/), as well as a benchmarking dataset for evaluation.

At the moment, we have processed all PDBs released up until 30 April, 2025 and latest deposit date of 31 March, 2025

Files

README.md

Files (12.6 GB)

Name Size Download all
md5:fa7b47339537d10829a3395feb5cfb6f
1.2 kB Preview Download
md5:732ab71380810d316da1acdeadd8a0bb
13.8 kB Preview Download
md5:e2da351721ba9111c6c8b9e848c19000
12.6 GB Preview Download

Additional details

Software

Repository URL
https://github.com/Sanofi-Public/SNAC-DB
Programming language
Python, Jupyter Notebook
Development Status
Active

References

  • H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235-242 https://doi.org/10.1093/nar/28.1.235.
  • Updated resources for exploring experimentally-determined PDB structures and Computed Structure Models at the RCSB Protein Data Bank (2025) Nucleic Acids Research 53 D564–D574 https://doi.org/10.1093/nar/gkae1091
  • H.M. Berman, K. Henrick, H. Nakamura Announcing the worldwide Protein Data Bank (2003) Nature Structural Biology 10:980 https://doi.org/10.1038/nsb1203-980.
  • van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., and Steinegger, M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
  • Dunbar, J., & Deane, C. (2015). ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2), 298–300.
  • Steinegger, M. and Söding, J., (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), pp.1026-1028.