Published May 11, 2018 | Version v1
Project deliverable Open

Report on prototypes constructed using Big Data approaches.

  • 1. STFC
  • 2. EMBL-EBI

Description

We have identified two projects for prototyping which are useful in the structural biology domain and also acts as a prototype for future work, demonstrating a particular Big Data technology and providing some initial useful functionality. The first uses Natural Language Processing (NLP) methods applied to the structural biology literature to identify some information that is not currently incorporated into databases: structural annotations on specific residues within proteins. The software created by this pilot project is an open source software package, pyresid, written in the Python programming language. Using it, annotations will be made to all past papers with known links to entries in the Protein Data Bank. These annotations are available in EuropePMC. For the second prototype, we looked at the use of Convolutional Neural Networks for distinguishing between protein and noise in cryoEM maps. Python scripts for generating input data from structural biology data for the machine learning, and for creating and training a model, are made available. While this serves as a useful prototype, further cleaning of the input data and better training of the model are still required.

Files

West-Life_D7_8.pdf

Files (603.0 kB)

Name Size Download all
md5:615307482a32a5dccd3705efa6c7aa70
603.0 kB Preview Download

Additional details

Funding

West-Life – World-wide E-infrastructure for structural biology 675858
European Commission