Project deliverable Open Access
Morris, Chris; Firth, Rob; Winn, Martyn; Talo, Francesco
We have identified two projects for prototyping which are useful in the structural biology domain and also acts as a prototype for future work, demonstrating a particular Big Data technology and providing some initial useful functionality. The first uses Natural Language Processing (NLP) methods applied to the structural biology literature to identify some information that is not currently incorporated into databases: structural annotations on specific residues within proteins. The software created by this pilot project is an open source software package, pyresid, written in the Python programming language. Using it, annotations will be made to all past papers with known links to entries in the Protein Data Bank. These annotations are available in EuropePMC. For the second prototype, we looked at the use of Convolutional Neural Networks for distinguishing between protein and noise in cryoEM maps. Python scripts for generating input data from structural biology data for the machine learning, and for creating and training a model, are made available. While this serves as a useful prototype, further cleaning of the input data and better training of the model are still required.