Protein Structure Annotations

This chapter aims to introduce to the specifics of protein structure annotations and their fundamental position in Structural Bioinformatics, and Bioinformatics in general. Proteins are profoundly characterized by their structure in every aspect of their functioning and, while over the last decades there has been a close to exponential growth of known protein sequences, the growth of known protein structures has been closer to linear because of the high complexity and cost of determining them. Thus, protein structure predictors are among the most thoroughly assessed tools in Bioinformatics (in venues such as CASP or CAMEO) because they allow the structural study of proteins on a large scale. This chapter presents the key types of protein structure annotation and the methods and algorithms for predicting them, with the aim to give both a historical perspective on their development and a snapshot of their current state-of-the-art. From one-dimensional protein annotations – i.e., secondary structure, solvent accessibility and torsion angles – to more complex and informative two-dimensional protein abstractions – i.e., contact maps – , both mature and currently developing methods for protein structure annotations are introduced. The aim of this overview is to facilitate the adoption and development of state-of-the-art protein structural predictors. Particular attention is given to some of the best performing and freely available web servers and standalone programs to predict protein structure annotations.


Protein Structure Annotations
Proteins hold a unique position in structural Bioinformatics.In fact, more so than other biological macromolecules such as DNA or RNA, their structure is directly and profoundly linked to their function.Their cavities, protuberances, and their overall shapes determine with what and how they will interact, and, therefore, the roles assumed in the hosting organism.Unfortunately, the complexity, wide variability, and ultimately the sheer number of diverse structures present in nature, make the characterization of proteins extremely expensive, and complex.For this reason, considerable effort has been spent on predicting protein structures by computational means, either directly, or in the form of abstractions that simplify the prediction while still retaining structural information.These abstractions, or protein structure annotations, may be one-dimensional when they can be represented by a string or a sequence of numbers, typically of the same length as the protein's primary structure (the sequence of its amino acids).This is the case, for instance, of secondary structure (SS) or solvent accessibility (SA).Another important class of abstractions is composed of two-dimensional properties, that is, features of pairs of amino acids (AA) or SS, such as contact and distance maps, disulphide bonds, or pairings of strands into β-sheets.
Machine Learning (ML) techniques have been extensively used in Bioinformatics, and in Structural Bioinformatics in particular.The abundance of freely available data -such as the Protein Data Bank (PDB) [1], and the Universal Protein Resource [2] -, and their complexity, make Proteins an ideal domain where to apply the most recent, and sophisticated ML techniques, such as Deep Learning [3].Nonetheless, there are pitfalls to avoid and best practices to follow to correctly train and test any ML method on protein sequences [4].
Deep Learning is a collection of methods and techniques to efficiently train nuanced parametric models such as Neural Networks (NN) with multiple hidden layers [5].These layers contain hierarchical representations of the features of interest extracted from the input.NN are the de facto standard ML method to predict protein structure annotations.They have a central role at the two most important academic assessments of protein structure predictors: CASP and CAMEO [6].Thus, they are widely used to predict protein one-dimensional and two-dimensional structural abstractions.
A typical predictor of protein structure annotations will first look for evolutionary information (PSI-BLAST is commonly used for this task), then will encode the information found, following this will run a ML method (usually a NN) on the encoded information and finally will process the output into a human-readable format.Differently from ab initio methods, template-based predictors directly exploit structural information of resolved proteins alongside evolutionary information [7].
Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) [8] is the de facto standard algorithm, released with the BLAST+ suite, to address protein alignment.In particular, it is commonly used in substitution of BLAST, whenever remote homologues have relevance.PSI-BLAST executes a BLAST call to find similar proteins in a given database, then it either uses the resulting Multiple Sequence Alignment (MSA) to construct a Position Specific Score Matrix (PSSM), or outputs the MSA itself.The entire process is usually iterated few times using the last PSSM as query for the next iteration -in order to improve the PSSM, and, thus, maximize the sensitivity of the method.The trade-off for increasing the number of iterations, and the sensitivity of the method, is a higher likelihood of corrupting the PSSM, including false positive queries into it [9].For this reason, and the nature itself of the tool, it is fundamental to consider PSI-BLAST as a predicting tool and not as an exact algorithm [10].
HHblits [11] is a 2011 algorithm to address protein alignment.It focuses on fast iterations, and high precision and recall.It obtains these gains by adopting Hidden Markov Models (HMM) to represent both query and database sequences.The overall approach resembles the PSI-BLAST one -except that HMM rather than PSSM are the central entity.In fact, the heuristic algorithm looks for similar proteins in the HMM database at first.Then, it either uses the resulting HMM to improve the HMM query, and iterate, or outputs the MSA found with the last HMM.The same trade-off between number of iterations and likelihood of corrupting the HMM stands for HHblits as it does for PSI-BLAST.
In this chapter we review the main abstractions of protein structures.Namely, SS, SA, torsional angles (TA) and distance/contact maps.For each of them we describe an array of ML algorithms that have been used for their characterisation, point to a set of public tools available to the research community, including some that have been developed in our laboratory, and try to outline the state of the art in their prediction.These structure annotations are complementary with one another as they look at proteins from different views.That said, some annotations received far more interest from the bioinformatics community than others, for reasons such as simplicity or the intrinsic nature of the feature itself.We focus more on these well-assessed annotations, keeping in mind that the main function of protein structure annotations is to facilitate the understanding of the very core of any protein: the three-dimensional structure.
The PSSM built by PSI-BLAST, or the HMM built by HHblits, or the encoded MSA built by either PSI-BLAST or HHblits, are generally used as inputs to a protein feature predictor.Different releases of the database used to find evolutionary information may lead to different outcomes.Normally, a computer able to look for evolutionary information (thus, execute PSI-BLAST or HHblits calls successfully) has the right hardware to run the standalones here presented with no problem.
All the predictors described below offer a web server, are free for academic use and provide licenses for commercial users at the time of writing.The web servers described return a result(/prediction) in anything between a few minutes and a few hours.

Secondary Structure
SS prediction is one of the great historical challenges in Bioinformatics [12], [13].Its history started in 1951, when Pauling and Corey predicted for the first time the existence of what were later discovered to be the two most common SS conformations: α-helix and β-sheet [14].Notably, the very first highresolution protein structure was determined only in 1958 (and led to a Nobel Prize to Kendrew and Perutz) [15], [16].These early successes motivated the first generation of protein predictors, which were able to extrapolate statistical propensities of single AA (or residue) towards certain conformations [17].The slow but steady growth of available data and more insights on protein structure led to the second generation of predictors, which expanded the input to segments of adjacent residues (3-51 AA) to gather more useful information, and assessed many available theoretical algorithms on SS [12].In the 90s, more available computational power and data allowed the development and implementation of more advanced algorithms, able to look for and take advantage of evolutionary information [13].Thus, the third generation of SS predictors was the first able to predict at better than 70% accuracy [18], efficiently exploit PSI-BLAST [19] and implement deep NN [20].In 2002, SS was removed from CASP since the few and relatively short targets assessed at the venue were not considered statistically sufficient to evaluate the mature methods available [21].
The intrinsic nature of SS, being an intermediate structural representation between primary and tertiary structure, makes it a strategic and fundamental one-dimensional protein feature.It is often adopted as intermediate step towards more complex and informative features (i.e., contact maps [22]- [24], the recognition of protein folds [25] and protein tertiary structure [26]).In other words, a high-quality SS prediction can greatly help to understand the nature of a protein and lead to a better prediction of its structure.For example, SS regularities characterize the proteins in a common fold [27].
The theoretical limit of SS prediction is usually set at 88-90% accuracy per AA [13].This limit is mainly derived from the disagreement on how to assign SS and from the intrinsic dynamic nature of protein structure -i.e. the protein structure changes according to the fluid in which the protein is immersed.In particular, Define Secondary Structure of Proteins (DSSP) [28], the gold standard algorithm to assign SS given the atomic-resolution coordinates of the protein, agrees with the PDB descriptions around 90.8% of the time [29].While DSSP aims to provide an unambiguous and physically meaningful assignment, the PDB represents the ground truth in structural proteomics [1].
All the SS predictors described in this chapter exploit different architectures of NN to perform their predictions.The list of AA composing the protein of interest is the only input required.The SS is often classified in three-states -i.e., helices, sheets and coils -, although the DSSP identifies a total of 8 different classes.Because of the higher difficulty of the task, compounded also by the rare occurrence of certain classes -i.e., π-helix and β-bridge -, only 3 of the predictors presented here (Porter5, RaptorX-Property and SSpro) can predict in both three-states and eight-states.The DSSP classification of SS in eight-states is the following one: • G = 3-turn helix (310-helix), minimum length 3 residues; • H = 4-turn helix (α-helix), minimum length 4 residues; • I = 5-turn helix (π-helix), minimum length 5 residues; • T = hydrogen bonded turn (3, 4 or 5 turn); • E = extended strand (in β-sheet conformation), minimum length 2 residues; • B = residue in isolated β-bridge (single pair formation); • S = bend (the only non-hydrogen-bond based assignment); • C = coil (anything not in the above conformations).
When SS is classified in three states, the first three (G, H, I) are generally considered helices, E and B are classified as strands and anything else as coils.SS prediction is evaluated looking at the rate of correctly classified residues (per class) -i.e., Q3 or Q8 for three-or eight-states prediction, respectively -or at the segment overlap score (SOV) -i.e., the overlap between the predicted and the real segments of SS [30] -, for a more biological viewpoint.The best performing ab initio SS predictors are able to predict three-state SS close to 85% Q3 accuracy and SOV score.
The table below gathers name, web server and notes on special features of every SS predictor presented in this chapter.A standalone -i.e., downloadable version that can run on a local machine -is currently available for all of them.
The web server of Jpred4 is available at http://www.compbio.dundee.ac.uk/jpred4/.It requires a protein sequence in either FASTA or RAW format.Using the advanced options, it is also possible to submit multiple sequences (up to 200) or MSA as files.An email address and a JobID can be optionally provided.When a single sequence is given, Jpred4 looks for similar protein sequences in the PDB [1] and lists them when found.Checking a box it is possible to skip this step and force an ab-initio prediction.Jpred4 relies on a version of UniRef90 [2] released in July 2014, while the PDB is regularly updated.
The result page is automatically shown and offers a graphical summary of the prediction along with links to possible views of the result in HTML (simple or full), PDF and Jalview [38] (in-browser or not).
It is also possible to get an archive of all the files generated or navigate through them in the browser.
If an email address is submitted, a link to the result page and a summary containing the query, predicted SS and confidence per AA will be sent.The full result, made available as HTML or PDF, lists the ID of similar sequences used at prediction time, the final and intermediate predictions for SS, the prediction of coiled-coil regions, the prediction of SA with three different thresholds (0, 5 and 25% exposure) and the reliability of such predictions.
Jpred4 is not released as standalone but it is possible to submit, monitor and retrieve a prediction using the command line software available at http://www.compbio.dundee.ac.uk/jpred4/api.shtml.
A second package of scripts is made available at the same address to facilitate the submission, monitoring and retrieving of multiple protein sequences.More instructions and examples on how to use the command line software are presented on the same page.

PSIPRED
PSIPRED is a high quality SS predictor freely available since 1999 [19].Its last version (v4.01) has been released in 2016.PSIPRED exploits the PSSM of the protein to generate its prediction by Neural Networks.Like SSpro (described below), it recommends the implementation of the legacy BLAST package (abandoned in 2011) to collect evolutionary information.The BLAST+ package (the active development of BLAST) fixes multiple bugs and provides improvements and new features, but scales by 10 and rounds the PSSM, thus provides less informative outputs for PSIPRED.BLAST+ is experimentally supported by PSIPRED.
The web server of PSIPRED [39], called The PSIPRED Protein Sequence Analysis Workbench, runs a 2012 release of PSIPRED (v3.3) and can be found at http://bioinf.cs.ucl.ac.uk/psipred/.A single sequence, (or its MSA) and a short identifier are expected as input.Optionally, an email address can be inserted to receive a confirmation email (with link to the result) when the prediction is ready.Several prediction methods (for other protein features) can be chosen.The default choice (picking only PSIPRED) is sufficient to predict the SS.If the submission proceeds successfully, a courtesy page will be shown until the result is ready.
The result page, organised in tabs, shows the list of AA composing the analysed protein (the query sequence) and the predicted SS class (using different colours).From the same tab, it is possible to select the full query sequence, or a subsequence, to pass it to one of the predictor methods available on the PSIPRED Workbench.The predicted SS is presented in the tab called PSIPRED using a diagram.
In the same diagram, the confidence of each prediction and the query sequence are included.The Downloads tab, the last one, allows the download of the information in the diagram as text or PDF or postscript or of all three versions.The last release of PSIPRED is typically available as a standalone at http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/.Once the standalone has been downloaded and extracted, it is sufficient to follow the instructions in the README to perform predictions on any machine.The output will be generated in text format only as horizontal or vertical format.The latter will contain also the individual confidence per helix, strand and coil.Notably, the results obtained from the standalone may very well differ from those obtained from the PSIPRED Workbench.The latter does not implement the last PSIPRED release, at the time of writing.
In 2013, a preliminary package (v0.4) has been released to run PSIPRED on Apache Hadoop.Hadoop is an open-source software to facilitate distributed processing in computer clusters.Although this PSIPRED package is intended as an alpha build, instructions to install it on Hadoop and on AWS (the cloud service of Amazon) are provided.This package does not contain any standalone of PSIPRED.Thus, it is an interface to run the selected PSIPRED release on Hadoop.It can be downloaded at http://bioinfadmin.cs.ucl.ac.uk/downloads/hadoop/.

Porter
Porter is a high quality SS predictor which has been developed starting in 2005 [32] and improved since then [7], [40].Porter is built on carefully tuned and trained ensembles of cascaded Bidirectional Recurrent Neural Networks [20].It is typically built on very large datasets, which are released as well.Its last release (v5) is available as web server and standalone [41].Differently from PSIPRED, it implements BLAST+ to gather evolutionary information.To maximise the gain obtained from evolutionary information, it also adopts HHblits alongside PSI-BLAST.Porter5 is one of the three SS predictors presented here that are able to predict both three-states and eight-states SS.
The web server can be found at http://distilldeep.ucd.ie/porter/.The basic interface asks for protein sequences (in FASTA format) and for an optional email address.Up to 64KB of protein sequences can be submitted at the same time, which approximately corresponds to 200 average proteins.Differently from other SS web servers, there is no limit of total submissions.The confirmation page will contain a summary of the job, the server load (how many jobs are to be processed) and the URL to the result page.It is automatically refreshed every minute.The detailed result page will show the query, the SS prediction and the individual confidence.In other words, the same information shown by the PSIPRED Workbench is given in text format.The time to serve the job is shown as well.Optionally, if an email address has been inserted, all the information in the result page is sent by email.Thus, it can potentially be retrieved at any time.It is possible to predict SS and other protein structure annotations (one-dimensional or not) submitting one job at http://distillf.ucd.ie/distill/.
The very light standalone of Porter5 (7 MB) is available at http://distilldeep.ucd.ie/porter/.It is sufficient to extract the archive on any computer with Python3, HHblits and PSI-BLAST to start predicting any SS.Using the parameter --fast, it is possible to avoid PSI-BLAST and perform faster but generally slightly less accurate predictions.When the prediction in three-states and eight-states completes successfully, it is saved in 2 different files.Each file shows the query, the predicted SS and the individual confidence per class.The datasets adopted for training and testing purposes are available at the same address.

RaptorX-Property
RaptorX-Property, released in 2016, is a collection of methods to predict one-dimensional protein annotations [33].Namely, SS, SA and disorder regions are predicted from the same suite.The SS is predicted in both three-states and eight-states, as with Porter5 and SSpro.At the cost of lower accuracy, evolutionary information can be avoided to perform faster predictions.Its last release substitutes PSI-BLAST with HHblits to get faster protein profiles.
The web server of RaptorX-Property is available at http://raptorx.uchicago.edu/StructurePropertyPred/predict/.Jobname and email address are recommended but not required.Query sequences can be uploaded directly from one's machine.Otherwise, up to 100 protein sequences (in FASTA format) can be passed at the same time through the input form.The system allows up to 500 pending (sequence) predictions at any time.The current server load, shown in the sidebar, tells the pending jobs to complete.
Once the job has been submitted, a courtesy page will provide the URL to the result page, how many pending jobs are ahead and the JobID.Less priority is given to intensive users.The jobs submitted in the previous 60 days are retrievable clicking on "My Jobs".Once the prediction is performed, the result page will show a summary of it using coloured text.At the bottom of the page, the same information is organised in tabs, one tab per feature predicted (SS in 3 and 8-state, SA and disorder).The individual confidence is provided in the tabs.All this information is sent by email (in txt and rtf format), if an email address has been provided.Otherwise, it can be downloaded clicking the specific button.

SPIDER3
SPIDER3 is the second version of a recent SS predictor first released in 2015 [42].Its last release is composed by 2 NN, the first of which predicts SS while the second predicts backbone angles, contact numbers and SA [34].It internally represents each AA using 7 representative physio-chemical properties [43].Like Porter5, it implements both HHblits and PSI-BLAST to look for more evolutionary information.SPIDER3 is also described in sections Solvent Accessibility and Torsion Angles, respectively.
The web server of SPIDER3 is available at http://sparks-lab.org/server/SPIDER3/.An email address is required when multiple sequences are submitted, or to receive a summary of the prediction.Otherwise, the query sequence is sufficient to submit the job and obtain an URL to the result page.The web server allows up to 100 protein sequences (in FASTA format) at a time and accepts optional JobID.To prevent duplicates, it is possible to visualize the queue of jobs submitted from one's IP address.The result page presents the query sequence and the predicted SS and SA, in a simple and colour-coded text format.In the same page, it is possible to download a summary (containing the same information) or an archive with the 4 features predicted and the individual confidence for SS.There is also a link to a temporary directory containing all the files created during the prediction, including the HMM and the PSSM.The standalone of SPIDER3, and the dataset used to train and test it, can be downloaded at http://sparks-lab.org/server/SPIDER3/.The main prerequisite is to install a python library of choice between Numpy and Tensorflow r0.11 (an older version).As for Porter5, it is then sufficient to install HHblits and PSI-BLAST to perform SS prediction on one's machine.The outcome of SS, SA, torque angles and contact number prediction will be saved in different columns of just one file.The storage required is 101MB and 117 MB, respectively, without considering the library of choice.

SSpro
SSpro is a historical SS predictor developed starting in 1999 [20], [35].Similarly to PSIPRED, it implements the BLAST package rather than the more recent BLAST+.The last version of SSpro (v5) has been released in 2014, together with ACCpro (see Solvent Accessibility, ACCpro), and performs template-based SS predictions [35].More specifically, it exploits PSI-BLAST to look for homologues at both sequence and structure level [7].In other words, SSpro v5 has an additional final step in which it looks for similar proteins in the PDB.
SSpro is available at http://scratch.proteomics.ics.uci.edu/ as part of the SCRATCH protein predictor [44].SS is among the several (one-dimensional or not) protein features predictable on SCRATCH.Like Porter5 and RaptorX-Property, it is possible to predict both three-states (SSpro) and eight-states (SSpro8) predictions.Once SSpro or SSpro8 is selected, an email is required and optionally a JobID.Only one protein (of up to 1500 residues) can be submitted at a time.There are 5 total slots in the job queue per user.Once ready, the result of the prediction will be sent by email only.It will contain: the JobID, the query sequence, the predicted SS (in three or eight classes) and a link to the explanation of the output format.
The standalone of the last SSpro (v5.2) and ACCpro (described in section Solvent Accessibility) compose the SCRATCH suite of 1D predictors available at http://download.igb.uci.edu/.SCRATCH v1.1 is released with all the prerequisites to set-up and run SSpro.The BLAST package and the databases with both sequences and structural information are included.Thus, the amount of disk-space needed to download and extract SCRATCH v1.1 is considerable (5.7 GB, 97MB without databases).

Solvent Accessibility
SA describes the degree of accessibility of a residue to the solvent surrounding the protein.SA is second only to SS among extensively studied and predicted one-dimensional protein structure annotations.The effort invested into SA predictors has been significant from the early 90s and highly motivated from the successes obtained developing the third generation of SS predictors [45].In fact, similarly to SS prediction but sometimes with some time-delay, mathematical and statistical methods [46], NN [47], evolutionary information [48] and deep NN [49] have been increasingly put to work to predict SA.
Although SA is less conserved than SS in homologous sequences [47], it is typically adopted in parallel with SS in many pipelines towards more complex protein structure annotations such as CM -e.g., SA and SS are predicted for any CM predictor described in section Contact Maps [22], [23], [50], [51] -, protein fold recognition [25] and protein tertiary structure [52].Notably, a strong (negative) correlation of -0.734 between SA and contact numbers has been observed by Yuan [53] and is motivating the development of predictors for contact number as a possible alternative to SA predictors [54].
Though there are promising examples of successful NN predictors considering adjacent AA to predict SA since the 90s [48], different methods such as linear regression [55] or substitution matrices [45] have been assessed but the state-of-the-art has been represented by deep NN since 2002 [49].Thus, all the SA predictors described below (and summarised in the table) implement deep NN [33]- [35], [40] predicting SA as anything between a two-state problem -i.e., buried and exposed with an average two-state accuracy greater than 80% -to twenty states.
SA has been typically measured as accessible surface area (ASA) -i.e., the protein's surface exposed to interactions with the external solvent.ASA is usually obtained normalizing the relative SA value observed by the maximum possible value of accessibility for the specific residue according to the DSSP [28].The ASA of a protein can be visualized with ASAview, a tool developed in 2004 that requires real values extracted from the PDB or coming from predicted ASA [56].More recently, a different approach to measuring the SA, called half-sphere exposure (HSE), has been designed by Hamelryck [57].The idea is to split in half the sphere surrounding the Cα atom along the vector of Cα-Cβ atoms aiming to provide a more informative and robust measure [57].SPIDER3 can predict both HSE and ASA using real numbers [34].

ACCpro
ACCpro is a historical SA predictor initially released in 2002 [49].Since then, it has been developed in parallel with SSpro (see Secondary Structure, SSpro) and last updated to its v5 in 2014, adding support for template-base predictions [35].Thus, like SSpro, ACCpro adopts the legacy BLAST to look for evolutionary information at both sequence and structure level.ACCpro predicts whether each residue is more exposed than 25% or not, while ACCpro20, an extension of ACCpro, distinguishes twentystates from 0-95% with incremental steps of 5% -i.e., ACCpro classifies twenty classes, starting from 0-5% to 95-100% of SA.
The web server of ACCpro and ACCpro20 is available at http://scratch.proteomics.ics.uci.edu/ as part of SCRATCH [44].Once an email and the sequence to predict have been inserted, it is possible to select ACCpro or ACCpro20 or any of the available protein predictors.More in Secondary Structure, SSpro.The standalone of ACCpro has been updated in 2015 and is available at http://download.igb.uci.edu/ as part of SCRATCH-1D v1.1.As described above (in Secondary Structure, SSpro) all the requirements are delivered together with the bundled predictors -i.e., ACCpro, ACCpro20, SSpro and SSpro8.
The web server of PaleAle is available at http://distilldeep.ucd.ie/paleale/.As for Porter and Porter+ (see respective sections) the protein sequence is the only requirement while an email address is optional.More information about these servers is available in the Secondary Structure, Porter subsection.

RaptorX-Property
RaptorX-Property, described in section Secondary Structure, is 2016 suite of predictors able to predict SA, SS and disorder regions [33].RaptorX-Property predicts SA in three-states with thresholds at 10% and 40%, respectively.As for SS predictions, RaptorX-Property can avoid to look for evolutionary information to speed up predictions at the cost of lower accuracy.It relies on HHblits [11] to gather evolutionary information.
The web server of RaptorX-Property is available at http://raptorx.uchicago.edu/StructurePropertyPred/predict/.The result page of RaptorX-Property provides the predicted 1D-annotations in different tabs (Figure 9 shows the three-states SA).The web server and the released standalone are described in section Secondary Structure, RaptorX-Property.

SPIDER3
SPIDER has been able to predict SA, SS and TA since 2015 [42] and was updated in 2017 [34].SPIDER3, described also in sections Secondary Structure and Torsion Angles, predicts the ASA using real numbers rather than classes, differently from the other predictors here presented [42].SPIDER2 has been the first HSE predictor [54] while SPIDER3 predicts HSEα-up and HSEα-down using real numbers, although Heffernan et al. reports results also on HSEβ-up and HSEβ-down [34].
The web server and the standalone of SPIDER3 are described in Secondary Structure, SPIDER3.As a side note, the result page and the confirmation email of the web server show the predicted SA only as ASA in ten-classes -i.e., [0-9] -while the predicted ASA, HSEβ-up and HSEβ-down in real numbers are listed in the output file ("*.spd33") in the temporary directory, along with PSSM/HMM files (see Figure 5).

Torsional Angles
Protein torsion (or dihedral or rotational) angles can accurately describe the local conformation of protein backbones.The main protein backbone dihedral angles are: phi (ϕ), psi (ψ) and omega (ω).The planarity of protein bonds restricts ω to be either 180° (typical case) or 0° (rarely).Therefore, it is generally sufficient to use ϕ and ψ to accurately describe the local shape of a protein.
TA are highly correlated to protein SS and particularly informative in highly variable loop regions.In fact, while TA of α-helices and β-sheets are mostly clustered and regularly distributed [58], ϕ and ψ can be more effective in describing the local conformation of residues when they are classified as coils (i.e., neither of the other SS classes).When four consecutive residues are considered, a different couple of angles can be observed: theta (θ) and tau (τ) [59].Thus, different annotations (i.e., SS, ϕ/ψ and θ/τ) can be adopted to describe the backbone of a protein.
TA are essentially an alternative representation of local structure with respect to SS.Both TA and SS have been successfully used as restraints toward sequence alignment [60], protein folding [25] and tertiary structure prediction [61].HMM [62], support vector machines (SVM) [58] and several architectures of NN (e.g., iterative [34], [42] and cascade-correlation [63]) have been analysed to predict TA since 2000.NN are currently the main tool to predict TA, in parallel with protein SS [34] or sequentially after it [63], [64].
ϕ and ψ can be predicted as real numbers or letters(/clusters).In fact, ϕ and ψ can range from 0° to 360° but are typically observed in certain ranges, given from chemical and physical characteristics of proteins.Bayesian probabilistic [65], [66], multidimensional scaling (MDS) [67] and density plot [58] approaches have been exploited to define different alphabets of various sizes.

Porter+
Porter+ is a TA predictor able to classify the ϕ and ψ angles of a given protein.It was initially developed in 2006 as intermediate step to improve Porter (a SS predictor described in section Secondary Structure) [64].Porter+ adopts an alphabet of 16 letters devised by Sims et al. using MDS on tetrapeptides (4 contiguous residues) [67].Porter+, similarly to Porter and PaleAle (see Solvent Accessibility, PaleAle), implements BLAST+ to gather evolutionary information and improve the final prediction.As Porter and PaleAle, the most recent version of Porter+ (v5) adopts also HHblits to greatly improve its accuracy.
The web server of Porter+ is available at http://distilldeep.ucd.ie/porter+.The protein sequence is required, while an email address is optional.It will be then sufficient to confirm (clicking "Predict") to view a confirmation page with the overview of the job.Once ready, the prediction will be received by email.It will resemble the format adopted for Porter, see in section Secondary Structure.Porter+ can be executed in parallel with Porter or PaleAle, or several more protein predictors, at http://distillf.ucd.ie/distill/ to predict SS, SA, or other protein features, respectively.The light standalone of Porter+ is available at http://distilldeep.ucd.ie/porter+ and closely resembles the one described in section Secondary Structure, Porter.The output of Porter+ overviews the confidence for all 14 classes predicted.The datasets adopted for training and testing purposes are also released.

SPIDER3
SPIDER3, also in section Secondary Structure and Solvent Accessibility, predicts TA using real numbers (ℝ).SPIDER was initially released in 2014 to predict only θ/τ [59].It has been further developed to also predict ϕ/ψ, in parallel with SS, SA and contact numbers (see the respective sections) [34], [42].More details, regarding the pipeline implemented, the web server offered and the standalone available, are outlined in section Secondary Structure.

Contact Maps
Contact Maps (CM) are the main two-dimensional protein structure annotation.A plain 2D representation of protein tertiary structure would describe the distance between all possible pairs of AA using a matrix containing real values.Such dense representation, referred as distance map, is reduced to a more compact abstraction -i.e., CM -by quantising a distance map through a fixed threshold, i.e. describing distances not as real numbers but as contacts (distance smaller than the threshold) or no.This latter abstraction is routinely exploited to reconstruct protein tertiary structures implementing heuristic methods [68], [69].Thus, 3D structure prediction being a computationally expensive problem motivates the development of the aforementioned heuristic methods that aim to be both robust against noise in the CM -i.e., to ideally fix CM prediction errors -and computationally applicable on a large scale [70], [71].Following closely the development of the third generation of SS predictors, motivated by the same abundance of available data and computational resources, MSA have been thoroughly tested and successfully exploited to extract promising features for CM prediction -e.g., correlated mutations, sequence conservation, alignment stability and family size [72]- [74].These initial advancements led to the first generation of ML methods able to predict CM [24], [75], [76].Though, given that MSA are replete with useful but noisy information, statistical insights have been necessary to further exploit the growing amount of evolutionary information -e.g., distinguishing between indirect-and direct-coupling [77], [78].The most recent CM predictors gather recent intuitions in both statistics and advanced ML, aiming to collect, clean and employ as much useful data as possible [22], [33], [50].Differently from the other protein annotations in this chapter, CM is currently assessed at CASP [79] and CAMEO [6].
The intrinsic properties of CM -namely, being compact and discrete two-state annotations, invariant to rotations and translations -makes them a more appropriate target for ML techniques than protein tertiary structures or distance maps although still highly informative about the protein 3D structures [80].CM prediction is a typical intermediate step in many pipelines to predict protein tertiary structure [52], [81], [82].For example, it is a key component for contact-assisted structure prediction [83], contact assisted protein folding [23], free and template-based modelling [81].CM have also been used to predict protein disorder [84], protein function [72] and to detect challenging templates [52].In fact, even partial CM can greatly support robust and accurate protein structure modelling [85].
Being a 2D annotation, CM are typically gradually predicted starting from simpler but less informative 1D annotations -e.g., SA, SS and TA [75], [76], [86].The advantages of this incremental approach lie in the intrinsic nature of protein abstractions -i.e., 1D annotations are easier to predict while providing useful insights.For example, Figure 14 highlights the strong relations between SS conformations and CM.The contact occupancy -i.e., contact number, or number of contacts per AA -is another 1D protein annotation which has been successfully predicted [34], [49], [87] to adjust and improve CM prediction [73], [75], [86].Eigenvector decomposition has been used as a means for template-search [88] and principal eigenvector (PE) prediction as an intermediate step towards CM prediction [24].Finally, correlated mutations appear to be the most informative protein feature for CM prediction -i.e., residues in contact tend to coevolve to maintain the physiochemical equilibrium [72]- [74].Thus, statistical methods have been extensively assessed to look for coevolving residues, gathering mutual information from MSA while aiming to discriminate direct-from indirectcoupling mutations -e.g., implementing sparse inverse covariance estimation to remove indirectcoupling [77], [89], [90].As in Figure 14, CM are represented as (symmetric) matrices or graphs -rather than vectors -where around 2-5% of all possible pairs of AA are "in contact" -i.e., an unbalanced problem in ML [80].
Notably, the number of AA in contact increases almost linearly with the protein length -i.e., shorter proteins are denser than longer ones [80].A pair of AA is in contact when the Euclidian distance between their Cβ (or Cα, for glycine) atoms is closer than a given threshold.This threshold is usually set between 6 and 12 Å (8Å at CASP [79]), although values in the range of 10-18 Å may lead to better reconstructions [68].In fact, it is arguable whether all predicted "contacts" should be taken in consideration or certain criteria should be applied, such as focusing on those predicted with the highest confidence -i.e., the top 10, L/5, L/2 or L contacts, with L = protein length -or with a minimum probability threshold [79].For example, tertiary structure modelling benefits more from well distributed contacts, thus the entropy score is one of the measures of interest to evaluate CM predictors [79].Precision -i.e., the ratio between true contact and (true contact + wrong contact) -is usually adopted to assess local (short range) contacts -i.e., involving AA within 10 positions apartand non-local (long range) contact, separately.Typically, CM predictors are evaluated at CASP through more complex measures [79], [83], [91], such as z-scores -i.e., weighted sum of energy separation with the true structure for each domain -, GDT_TS -i.e., score of optimal superposition between the predicted and the true structure -, root mean squared deviation (RMSD) or TM-score -i.e., a measure more sensitive at the global (rather than local) structure than RMSD [92].Classic statistical and ML measures, such as the aforementioned precision, recall, F1 score, Matthews Correlation Coefficient (MCC) are also adopted in parallel with more unusual ones, such as alignment depth or entropy score [79].The average precision of the top predictors at CASP12 was 47% on L/5 long range contacts for the difficult category, while the highest GDT_TS for each of the 14 domains assessed went from 12 to 70 [79].
Though correlated mutations and NN have been identified as promising instruments to also predict CM [75], pairwise contact potential [84], self-organising maps [93] and SVM [76] have been used in the past.While 2D-BRNN [86], [94], multi-stage [24], [95] and template-based [51] NN approaches have initially characterized the field [96], the most recent CM predictors rely on multiple 1D protein annotation predictors -e.g., predicting SA and SS along with other protein features -, two-stage approaches and coevolution information [50], [97] or multi-class maps [71], [96].The standard output format of any CM predictor is a text file organised in 5 columns as follow: the positions of the two AA in contact, a blank column, the set threshold (8Å) and the confidence of each predicted contact.
The web server and dataset of DNCON2 are available at http://sysbio.rnet.missouri.edu/dncon2/.JobID and email are required, along with the sequence to predict (up to two sequences at time).Once the prediction is ready, typically in less than 24h, the predicted CM is sent by email in both text and image format as email content and attachment, respectively.The email content specifies the number of alignments found and the predicted CM (in the standard 5 columns text format).
The standalone of DNCON2 is available at https://github.com/multicom-toolbox/DNCON2/.The same page lists all the instructions to install every requirement -i.e., CCMpred [90], FreeContact [89], HHblits [11], JackHMMER [99] and PSICOV [77] for coevolution information, python libraries (such as Tensorflow), MetaPSICOV and PSIPRED (see Secondary Structure, PSIPRED) for SS and SA prediction.Once all the requirements are met, it is possible to verify whether DNCON2 is fully running dealing with the predictions of 3 proposed sequences.The results of each predictor and package involved is organised in directories.

MetaPSICOV
MetaPSICOV is a CM predictor which has been initially released in 2014 for CASP11 [100] and updated in 2016 for CASP12 [97].It is recognised as the first CM predictor successfully able to exploit the recent advancements in co-evolutionary information extraction [101].In particular, MetaPSICOV achieved this result implementing three different algorithms to extract coevolution signal from MSA generated with HHblits [11] and HMMER [37] -i.e., CCMpred [90], FreeContact [89] and PSICOV [77] -along with other local and global features used for SVMcon [76].It relies on PSIPRED (see Secondary Structure, PSIPRED) to predict SS and a similar ML method to predict SA.As a final step, MetaPSICOV adopts a two-stage NN to infer CM from the features described [22].The web server and standalone of MetaPSICOV can be used to predict hydrogen bonding patterns [22].The web server of RaptorX-Contact is available at http://raptorx.uchicago.edu/ContactMap/.Once a protein sequence (in FASTA format) has been inserted, it is possible to submit it and a result URL will be provided (Figure 17).A JobID is recommended to distinguish among past submissions in the "My Jobs" page, while an email address can be specified to receive the outcome of RaptorX-Contact by email -i.e., the result URL and, as attachments, the predicted CM in text and image format.The tertiary structure is also predicted by default but it is possible to uncheck the respective box to speed up the CM prediction.Up to 50 protein primary structures can be submitted at the same time through the input form or uploaded from one's computer.Optionally, a MSA (of up to 20,000 sequences) can be sent instead of a protein sequence.The result URL links to an interactive page where it is possible to navigate the predicted CM besides downloading it in text or image format.The MSA generated (in A2M format), the CCMpred [90] output and the 3D models (if requested) are also made available.Finally, it is also possible to query the web server from command-line (using curl) as explained at http://raptorx.uchicago.edu/ContactMap/documentation/.

XX-STOUT
XX-STOUT is a CM predictor initially released in 2006 [24] and further improved to be templatebased [51] and multi-class in 2009 [96].XX-STOUT employs the predictions by BrownAle, PaleAle and Porter (see Secondary Structure and Solvent Accessibility) -i.e., contact density, SS and SA predictions, respectively -to generate multi-class CM -i.e., CM with four-states annotations.When either PSI-BLAST [9] or the in-house fold recognition software finds homology information, further inputs are provided to XX-STOUT to perform template-base predictions -i.e., greatly improve the prediction quality exploiting proteins in the PDB [1], [52].
The web server of XX-STOUT is available at http://distilldeep.ucd.ie/xxstout/.An email address and the plain protein sequence are required to start the prediction, a JobID is optional.The confirmation page summarises the information provided and the predictors which are going to be used -i.e., the aforementioned 1D predictors and SCL-Epred, a predictor of subcellular localization [103].The predicted CM (threshold 8Å), the prediction per residue of SS, SA and contact density, and the predicted protein's location are sent by email.The same email describes the confidence of SCL-Epred's prediction and whether the whole prediction has been based on PDB templates and, if found, of which similarity with the query sequence.The standalone of XX-STOUT and required 1D predictors is available on request.

Conclusions
In this chapter we have discussed the importance of protein structure to understand protein functions and the need for abstractions -i.e., protein structural annotations -to overcome the difficulties of determining such structures in vitro.We have then presented an overview of the role Bioinformatics -i.e., in silico Biology -has played in advancing such understanding, thanks to one-and twodimensional abstractions and efficient techniques to predict them that are applicable on a large scale, such as Machine Learning and Deep Learning in particular.The typical pipeline to predict protein structure annotations was also presented, highlighting the key tools adopted and their characteristics.
The chapter then described the main one-and two-dimensional protein structure annotations, from their definition to samples of state-of-the-art methods to predict them.We have given a concise introduction to each protein structure annotation trying to highlight what, why and how is predicted.We also tried to give a sense of how different abstractions are linked to one another and how this is reflected in the systems that predict them.
A considerable part of this chapter is dedicated to presenting, describing and comparing state-of-theart predictors of protein structure annotations.The methods presented are typically available as both web servers and standalone programs and, thus, can be used for small or large scale experiments and studies.The general aim of this chapter is to introduce and facilitate the adoption of in silico methods to study proteins by the broader research community.

Figure 1 :
Figure 1: The homepage of Jpred4.The input sequence is the only requirement while more options are made available.

Figure 2 :
Figure 2: A typical result page of PSIPRED web server.All the AA are listed and coloured according to the predicted SS class.

Figure 3 :
Figure 3: The input form of Porter5.Around 200 proteins can be submitted at once in FASTA format.

Figure 4 :
Figure 4: A partial view of the result page of RaptorX-Property.Each bar in the charts represents the individual confidence.The last standalone of RaptorX-Property (v1.01) can be downloaded at http://raptorx.uchicago.edu/download/.Once it has been extracted, it is sufficient to read and follow the instruction in README to predict SS, SA and disorder regions on one's own machine.As in the web server, it is possible to use or not sequence profiles and the results are saved in txt and rtf format.The disk-space required is relatively considerable, 347 MB at the time of writing, almost fifty times the storage required by Porter5.

Figure 6 :
Figure 6: A view of SCRATCH Protein Predictor.

Figure 7 :
Figure 7: A view of Scratch Protein Predictor where both ACCpro predictors have been selected.

Figure 8 :
Figure 8: A view of PaleAle5 where the reset button and the links are highlighted.The light standalone of PaleAle is available at the same address and requires only python3 and HHblits to perform SA predictions.As in Porter, PSI-BLAST can be optionally employed to gather further evolutionary information.The output file presents the confidence per each of the four-states predicted.The datasets are released at the same address.

Figure 9 :
Figure 9: The view on the predicted three-states SA performed by RaptorX-Property.

Figure 10 :
Figure 10: A view of the input window of SPIDER3.The steps to follow to start a prediction are highlighted.

Figure 12 :
Figure 12: A view of Porter+5 where the steps to start a prediction are highlighted.

Figure 13 :
Figure 13: A view of the results page of SPIDER3 where the steps to view the predicted TA are highlighted.

DNCON2Figure 15 :
Figure 15: The pipeline of DNCON2 is summarised in the confirmation page.

Figure 16 :
Figure 16: A typical result page of MetaPSICOV.All the files, except the png, follow PSICOV's format.The web server of the 2014 version ofMetaPSICOV is available at http://bioinf.cs.ucl.ac.uk/MetaPSICOV.A simple interface, which resembles the web server of PSIPRED (see Secondary Structure, PSIPRED), asks for a single sequence in FASTA format and a short identifier.A confirmation page is automatically shown when the job is completed.If an email address is inserted, an email containing only the permalink to the result page will be sent.As in Figure16, the result page contains links to the output of MetaPSICOV stage 1 (also as image), of stage 2, of MetaPSICOV-hb (hydrogen bonds) and of PSICOV.A typical CM takes between 20 minutes and 6 hours to be predicted.

Figure 17 :
Figure 17: The confirmation page of RaptorX-Contact tells the pending jobs ahead and the result URL.

Figure 18 :
Figure 18: XX-STOUT sends the predicted protein structure annotations in the body email except the CM (which is attached).