2018 YPIC Challenge: A case study in characterizing an unknown protein sample
For the 2018 YPIC Challenge, contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in Escherichia coli. In addition to deciphering the sentence, contestants were asked to determine the three-dimensional structure and detect any post-translation modifications left by the host organism. We present our experimental and computational strategy to characterize this sample by identifying the unknown protein sequence and detecting the presence of post-translational modifications. The sample was acquired with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules, after which spectral clustering was used to generate high-quality consensus spectra. De novo spectrum identification was used to determine the synthetic protein sequence, and any post-translational modifications introduced by E. coli on the synthetic protein were analyzed via spectral networking. This workflow resulted in a de novo sequence coverage of 70%, on par with sequence database searching performance. Additionally, the spectral networking analysis indicated that no systematic modifications were introduced on the synthetic protein by E. coli. The strategy presented here can be directly used to analyze samples for which no protein sequence information is available or when the identity of the sample is unknown. All software and code to perform the bioinformatics analysis is available as open source, and self-contained Jupyter notebooks are provided to fully recreate the analysis.
This document is the unedited Author's version of a Submitted Work that was subsequently accepted for publication in the Journal of Proteome Research, copyright © American Chemical Society after peer review. To access the final edited and published work see https://pubs.acs.org/articlesonrequest/AOR-gV4SwwddkH9NagGMX6bP.