Published February 25, 2025 | Version 1.0
Dataset Open

training data for SoleNNoID protein annotation software

Description

This dataset represents an edited and upgraded version of the REPETITA dataset which was used to train and test the SOLeNNoID convolutional neural network for solenoid residue classification. The original dataset comprised PDB files with full or partial protein structures for solenoid and non-solenoid proteins, as well as residue spans (e.g. residue 224-280) for solenoid regions.

In this version of the dataset, we include the original PDB files, distance matrices derived from these PDB files, which account for the missing residues in the PDB files, as well as revised labels by mapping and editing the original residue spans to produce a per-residue label file to match each structure/distance matrix. Additionally, extra beta-solenoid entries were manually added to the dataset.

Finally, a new test dataset was curated from reviewed solenoid entries in the RepeatsDB database.

The dataset is split into training_validation_dataset and test_dataset directories.

The training_validation_dataset directory comprises the original REPETITA dataset, split into alpha-, alpha/beta-, beta-, and non-solenoid directories. Each of these directories contains subdirectories with distance matrices, labels, and PDB structures. In addition, the beta_additional directory contains subdirectories with the distance matrices, labels and PDB structures of further manually added beta-solenoid entries.

The test_dataset directory comprises directories with non-solenoid and solenoid structures in mmCIF format, as well as a directory with the ground truth labels, and labels predicted by the TAPO, RepeatsDB-Lite and PRIGSA2 methods.

References and links:

REPETITA: Luca Marsella, Francesco Sirocco, Antonio Trovato, Flavio Seno, Silvio C.E. Tosatto, REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete Fourier transform, Bioinformatics, Volume 25, Issue 12, June 2009, Pages i289–i295, https://doi.org/10.1093/bioinformatics/btp232

RAPHAEL: Ian Walsh, Francesco G. Sirocco, Giovanni Minervini, Tomás Di Domenico, Carlo Ferrari, Silvio C. E. Tosatto, RAPHAEL: recognition, periodicity and insertion assignment of solenoid protein structures, Bioinformatics, Volume 28, Issue 24, December 2012, Pages 3257–3264, https://doi.org/10.1093/bioinformatics/bts550

SOLeNNoID: Nikov, Georgi and Pretorius, Daniella and Murray, James W., SOLeNNoID: A Deep Learning Pipeline For Solenoid Residue Detection in Protein Structures, bioRxiv, 2024

REPETITA/RAPHAEL dataset link: http://old.protein.bio.unipd.it/raphael/precompiled.html

 

Files

SoleNNoID_dataset.zip

Files (300.1 MB)

Name Size Download all
md5:cefac28224243c3a384631639a1a105d
300.1 MB Preview Download

Additional details

Related works

Funding

Engineering and Physical Sciences Research Council
DTP EP/R513052/1
Engineering and Physical Sciences Research Council
CDT in BioDesign Engineering EP/S022856/1

Dates

Available
2025-02-26