Sensitivity Datasets - Leveraging Implicit Knowledge in Neural Networks for Functional Dissection and Engineering of Proteins

Upmeier zu Belzen, Julius; Buergel, Thore; Holderbach, Stefan; Bubeck, Felix; Lehmann, Irina; Niopek, Dominik*; Eils, Roland*; iGEM Team Heidelberg 2017

doi:10.5281/zenodo.1402817

Published August 24, 2018 | Version v1

Dataset Open

Sensitivity Datasets - Leveraging Implicit Knowledge in Neural Networks for Functional Dissection and Engineering of Proteins

1. Synthetic Biology Group, Institute for Pharmacy and Biotechnology (IPMB) and Center for Quantitative Analysis of Molecular and Cellular Biosystems (BioQuant), University of Heidelberg, Heidelberg, 69120, Germany; Digital Health Center, Berlin Institute of Health (BIH) and Charité University Medicine, Berlin, 10117, Germany
2. Synthetic Biology Group, Institute for Pharmacy and Biotechnology (IPMB) and Center for Quantitative Analysis of Molecular and Cellular Biosystems (BioQuant), University of Heidelberg, Heidelberg, 69120, Germany
3. Molecular Epidemiology Unit, Berlin Institute of Health (BIH) and Charité University Medicine, Berlin, 10117, Germany
4. Digital Health Center, Berlin Institute of Health (BIH) and Charité University Medicine, Berlin, 10117, Germany; Health Data Science Unit, University Hospital Heidelberg, Heidelberg, 69120, Germany

Leveraging Implicit Knowledge in Neural Networks for Functional Dissection and Engineering of Proteins

The Sensitivity datasets cover more than 2000 proteins and are structured as follows.

It is uploaded as tar.gz. and structured in three separate directories.

mean_pdb/ contains the proteins for the proteins used to analyze the sphere variances, correlation with information content and correlations between GO terms (Figure 2c-e) and the ligand binding (Figure 3 a-c, Supplementary Figure 3).
mean_examples/ contains the proteins used for inferring the protein-receptor hybrids by the Hahn lab (Figure 5)
with_biological_activity/ contains the ERK2 data (Figure 3), the spCas9 data (Supplementary Figure 5) and the AcrIIA4 data (Figure 6)
binding_activities.csv contains the pdb identifiers and ligand descriptions for Figure 3, Supplementary Figure 3 and is needed for distance_to_ligand.py

Important

The biological activity data for spCas9 is from Brenan et al.¹: Supplementary Table 1. We used column ‘dox_average’, here 'mean_dox_average'.
The biological activity data for ERK2 is from Oakes et al.²: Supplementary Table 2. We used column ‘fold_change’ log2-transfomred, here 'mean_log2_fold_change'.
The sequences and secondary structure information were downloaded from the RCSB Protein Databank and are available here: https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz This URL can be found with some explanation at http://www.rcsb.org/pdb/static.do?p=download/http/index.html
The secondary structure annotation relies on the DSSP Algorithm by Kabsch and Sander³

The files are tab-separated and contain the following columns:

Pos Position in the sequence, starting from zero
AA Amino acid in that position
sec Secondary structure as annotated in the RCSB Protein Databank
dis if a region has not been experimentally observed (sometimes explains mismatches with crystal structures)
GO:_______ Sensitivity for that GO term
svar_GO:_______ Shere Variance of the sensitivity for that GO term
ic Information content, based on Pfam seed alignment
svar_n_neighbours number of residues in the sphere used to calculate the sphere variance
svar_d_center Distance to the center of mass of the chain that was analyzed
Others refer to biological activity data, depend on the source

References

Brenan, L. et al. Phenotypic Characterization of a Comprehensive Set of MAPK1/ERK2 Missense Mutants. Cell Rep 17, 1171-1183, doi:10.1016/j.celrep.2016.09.061 (2016).
Oakes, B. L. et al. Profiling of engineering hotspots identifies an allosteric CRISPR-Cas9 switch. Nat Biotechnol 34, 646-651, doi:10.1038/nbt.3528 (2016).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637, doi:10.1002/bip.360221211 (1983).

Notes

Authorship Statement The following are members of the iGEM (international genetically engineered machines) Team Heidelberg 2017: Lukas Adam, Thore Bürgel, Roland Eils, Catharina Gandor, Daniel Heid, Mareike Daniela Hoffmann, Stefan Holderbach, Michael Jendrusch, Marita Klein, Irina Lehmann, Jan Mathony, Dominik Niopek, Pauline Pfuderer, Lukas Platz, Moritz Przybilla, Carolin Schmelas, Max Schwendemann, Julius Upmeier zu Belzen, Max Waldhauer (all from Germany). Acknowledgements This work was funded by the Klaus-Tschira foundation, the German Research Council (DFG) and the Federal Ministry of Education and Research (BMBF). We thank Jürgen Quittek and Matthias Niepert (both NEC, Heidelberg), Thomas Wollmann (IPMB, BioQuant and German Cancer Research Center (DKFZ), Heidelberg) for helpful discussions and Marc Hemberger (BioQuant, Heidelberg) for support with IT and GPU cluster use.

Files

Files (110.5 MB)

Name	Size	Download all
data.tar.gz md5:574f7efa1ae2eb2c363b68cb041e2b93	110.5 MB	Download

Additional details

Is compiled by: https://github.com/juzb/DeeProtein (URL); 10.5281/zenodo.1402828 (DOI)

Brenan, L. et al. Phenotypic Characterization of a Comprehensive Set of MAPK1/ERK2 Missense Mutants. Cell Rep 17, 1171-1183, doi:10.1016/j.celrep.2016.09.061 (2016).
Oakes, B. L. et al. Profiling of engineering hotspots identifies an allosteric CRISPR-Cas9 switch. Nat Biotechnol 34, 646-651, doi:10.1038/nbt.3528 (2016).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637, doi:10.1002/bip.360221211 (1983).

	All versions	This version
Views	877	327
Downloads	89	24
Data volume	3.3 GB	2.7 GB

Sensitivity Datasets - Leveraging Implicit Knowledge in Neural Networks for Functional Dissection and Engineering of Proteins

Authors/Creators

Description

Notes

Files

Files (110.5 MB)

Additional details

Related works

References