There is a newer version of the record available.

Published October 18, 2023 | Version v1
Dataset Open

Training and test datasets for the PredictONCO tool

Description

This dataset was used for training and validating the PredictONCO web tool, supporting decision-making in precision oncology by extending the bioinformatics predictions with advanced computing and machine learning. The dataset consists of 1073 single-point mutants of 42 proteins, whose effect was classified as Oncogenic (509 data points) and Benign (564 data points). All mutations were annotated with a clinically verified effect and were compiled from the ClinVar and OncoKB databases. The dataset was manually curated based on the available information in other precision oncology databases (The Clinical Knowledgebase by The Jackson Laboratory, Personalized Cancer Therapy Knowledge Base by MD Anderson Cancer Center, cBioPortal, DoCM database) or in the primary literature. To create the dataset, we also removed any possible overlaps with the data points used in the PredictSNP consensus predictor and its constituents. This was implemented to avoid any test set data leakage due to using the PredictSNP score as one of the features (see below).

The entire dataset (SEQ) was further annotated by the pipeline of PredictONCO. Briefly, the following six features were calculated regardless of the structural information available: essentiality of the mutated residue (yes/no), the conservation of the position (the conservation grade and score), the domain where the mutation is located (cytoplasmic, extracellular, transmembrane, other), the PredictSNP score, and the number of essential residues in the protein. For approximately half of the data (STR: 377 and 76 oncogenic and benign data points, respectively), the structural information was available, and six more features were calculated: FoldX and Rosetta ddg_monomer scores, whether the residue is in the catalytic pocket (identification of residues forming the ligand-binding pocket was obtained from P2Rank), and the pKa changes (the minimum and maximum changes as well as the number of essential residues whose pKa was changed – all values obtained from PROPKA3). For both STR and SEQ datasets, 20% of the data was held out for testing. The data split was implemented at the position level to ensure that no position from the test data subset appears in the training data subset. 

For more details about the tool, please visit the help page or get in touch with us.

Files

PredictONCO-features.txt

Files (135.6 kB)

Name Size Download all
md5:858565e1800520cc2277d5795413d222
82.8 kB Preview Download
md5:7b02fe11ab5b159875d20d1380781a1f
37.0 kB Preview Download
md5:10bb4939a529e9ba65640b871b3db288
15.8 kB Preview Download

Additional details

References

  • Stourac J, Borko S, Khan RT, Pokorna P, Dobias A, Planas-Iglesias J, Mazurenko S, Pinto G, Szotkowska V, Sterba J, Slaby O, Damborsky J*, Bednar D*. PredictONCO: A Web Tool Supporting Decision Making in Precision Oncology by Extending the Bioinformatics Predictions with Advanced Computing and Machine Learning. 2023