PeptiDesCalculator: Software for computation of peptide descriptors. Definition, implementation and case studies for 9 bioactivity endpoints

We present a novel Java‐based program denominated PeptiDesCalculator for computing peptide descriptors. These descriptors include: redefinitions of known protein parameters to suite the peptide domain, generalization schemes for the global descriptions of peptide characteristics, as well as empirical descriptors based on experimental evidence on peptide stability and interaction propensity. The PeptiDesCalculator software provides a user‐friendly Graphical User Interface (GUI) and is parallelized to maximize the use of computational resources available in current work stations. The PeptiDesCalculator indices are employed in modeling 8 peptide bioactivity endpoints demonstrating satisfactory behavior. Moreover, we compare the performance of a support vector machine (SVM) classifier built using 15 PeptiDesCalculator indices with that of a recently reported deep neural network (DNN) antimicrobial activity classifier, demonstrating comparable test set performance notwithstanding the remarkably lower degree of freedom for the former. This software will facilitate the development of in silico models for the prediction of peptide properties.

PeptiDesCalculator software provides a user-friendly Graphical User Interface (GUI) and is parallelized to maximize the use of computational resources available in current work stations. The PeptiDesCalculator indices are employed in modeling 8 peptide bioactivity endpoints demonstrating satisfactory behavior. Moreover, we compare the performance of a support vector machine (SVM) classifier built using 15 PeptiDesCalculator indices with that of a recently reported deep neural network (DNN) antimicrobial activity classifier, demonstrating comparable test set performance notwithstanding the remarkably lower degree of freedom for the former. This software will facilitate the development of in silico models for the prediction of peptide properties.

K E Y W O R D S
antimicrobial, machine learning, peptide, PeptiDesCalculator

| INTRODUCTION
Over the past two decades, peptide drug discovery (PDD) has experienced renewed interest and momentum, thanks to the greater appreciation of the possible utility of peptides in addressing unmet clinical conditions and/or as better alternatives to small molecule therapeutics. Concurrently, the remarkable advancement of recombinant biologics in the recent years has rendered the high-throughput synthesis of macromolecules into a routine and cost-effective process, further contributing to the renaissance of PDD. 1 Peptides, defined as macromolecules composed of 2-50 amino acids, will probably attract increasing interest in the coming decades.
Their advantages include: high specificity and activity, easy degradation, do not yield toxic metabolites, and may be reutilized by the organism instead of being converted into waste products. 1,2 This implies that they generally possess reduced toxicity and few secondary effects. Indeed, the number of commercially available therapeutic peptides has in the last decades progressively increased (about 68 currently approved in the EU), 3 covering multiple clinical applications such as antineoplastics, antivirals, antifungals, antibiotics, modulators of the immune, cardiovascular, and nervous systems, in addition to their utility in diagnosis.
Notwithstanding the benefits of peptide-based therapy, the translation of promissory peptides into clinical therapeutics continues to be a challenge due to their inherent bio-and physicochemical properties, that is, are water-soluble and hence generally exhibit limited capacity to diffuse across biomembranes such as the gastrointestinal epithelium, are biologically unstable as they are rapidly metabolized by human proteolytic enzymes and thus yielding short plasma halflives. Consequently, peptides are generally administered through injections, often several times a day, in detriment of patients' compliance and convenience. 3 The ultimate and long sought after goal is to achieve orally administrable therapeutic peptides. Nonetheless, this will require PDD paradigms that integrate comprehensive analyses of bioactivity, pharmacodynamic, pharmacokinetic, and toxicological profiles of peptides in different phases of the PDD. Such workflows will allow for the design of peptides not only with favorable therapeutic efficacy, but also ensure their adequate bioavailability and administration.
In the path towards this goal, computational tools customized for predictive peptide modeling will be crucial, particularly in the context of the analysis of the existing experimental evidence to offer inferences on possible peptide bioactivity profiles. The utility of in silico tools in accelerating and optimizing drug discovery has long been recognized. 4 Moreover, the recent advances in machine learning algorithms and computing technology offer an opportunity to incorporate the state-of-the art computational techniques in PDD workflows.
As maybe anticipated, successful in silico predictive modeling requires adequate characterization of compositional, chemical and physicochemical attributes of peptide molecules. However, from our extensive review of the literature we noted that while there is software for calculating descriptors for small molecules and proteins, there is no equivalent software particularly customized for peptide descriptor calculation, as macromolecules at the interface of small organic molecules and proteins. Usually, research groups build inhouse scripts to compute descriptors from peptide sequences and amino acid properties or utilize the "Peptides package" of the R programming language which provides 10 structural characteristics for antimicrobial peptides. [5][6][7][8][9] Recently, there have been attempts to employ small molecule descriptor programs (eg, Dragon, PaDEL, CoMFA) to build peptide bioactivity models but these have been limited to short lengths peptides, that is, less than 10 amino acids and mainly di-, tri-, and tetrapeptides probably due to the prohibitive computational cost of applying small molecule software. 10,11 Considering that in the last decade average length of peptides entering clinical development is of 20 amino acids, 3 it is clear that the chemical space covered by these models is narrow.
Additionally, in a recent study an effort to consider diverse lengths yielded rather modest correlations, that is, R 2 < 0.56, 12 below the recommended limit of acceptability. 13 There is clearly a need for a userfriendly descriptor computing software customized for peptides.
On the other hand, while it is plausible that protein descriptors may be adopted as alternatives, these seem not to have gained traction in modeling of peptide bioactivity endpoints, probably because some protein descriptors may be redundant (eg, popular sequence autocorrelation indices, defined to consider up to 30 lag values, would be redundant for short length peptides). Moreover, important protein descriptors such as the solvent accessible surface area, would not make much sense for short lengths peptide sequences.
We present herein, a user-friendly and cross-platform java-based software denominated PeptiDesCalculator for computing descriptors for peptide molecules. The following contributions may be highlighted: (a) we have collected and reimplemented existent sequence based protein descriptors, normalized and/or truncated to suite the peptide domain, (b) applied aggregation operators that generalize the traditional approach of the summation of the amino acid contributions to obtain global peptide descriptions, [14][15][16][17][18][19]

| Molecular descriptors for peptides
The following descriptors have been implemented in the PeptiDesCalculator software: 1. Compositional descriptors, which include the amino acid, dipeptide and tripeptide sequence composition.

2.
Composition transition and distribution, descriptors as proposed by Dubchak et al. 22 These descriptors characterize the global composition of given amino acid properties, the frequency with which these properties vary along the peptide sequences, and the corresponding property distribution patterns. 22 Taking hydrophobicity as an example, the amino acids may be classified as hydrophobic, neutral, and polar, respectively. For a given peptide sequence, the composition descriptors are defined as percentages for each class of amino acids. On the other hand, the transition descriptors are defined as percentages of the frequency with which an amino acid in one class is followed by another from a different class, that is, hydrophobic followed by neutral (or neutral followed by hydrophobic), polar followed by hydrophobic (or hydrophobic followed by polar) and neutral followed by polar (or polar followed by neutral). Finally, the distribution descriptors are percentages of sequence lengths within which the first amino acid, 25%, 50%, 75%, and 100% of the amino acids with a given property are included.
3. Conjoint triad, descriptors as proposed by Shen et al. 23 These descriptors are defined following three main steps. Firstly, the 20 standard amino acids are clustered into seven classes based on the dipoles and volumes of their side chains (Table 1).
Next, the frequency of amino acid triads (ie, units of three contiguous amino acids) is determined, with a particularity that units with amino acids belonging to the same classes (Table 1) are considered as equivalent since they are deemed to play a similar role. Bearing in mind that the amino acids are stratified into seven clusters, the total number of triads is 343 (ie, 7 × 7 × 7). For a given peptide sequence, the frequency of each triad is determined yielding a vector F (f i ) where f i is the frequency of triad t i .
Finally, the conjoint triad descriptor is a vector D(t i ), defined as: where min and max refer to the minimum and maximum frequencies in the vector D(t i ).
where N refers to the number of amino acids in a sequence. The amino acid properties considered in the present study were compiled from the AA index database and the literature. 20,21 A total of 520 comprising of physicochemical, biochemical and topological amino acid properties were retrieved. Given this high number of properties and their possible correlation, dimensionality reduction was deemed necessary. To this end, k-means cluster analysis (k-CA) was employed.
The k-CA algorithm aims to stratify a set of objects (features or instances) into k clusters such that similar objects, as determined by a given similarity score, are assigned to the same clusters. From an optimization perspective, the k-CA may be understood as a min-max problem, where the intra-cluster variance is sought to be minimized while the inter-cluster variance is maximized. The partitioning of objects into k clusters allows for the selection of representative members from each cluster, and thus serving as a dimensionality reduction tool. For the k-CA performed herein, the squared Euclidean distance was employed as the similarity measure and the number of clusters (k) QSO a = f a P 20 QSO a + l = wr a−20 P 20 where f a is the frequency of amino acid a, w is an empirical weighting factor set to 0.75, r l = P , also known as the sequence order coupling number, d i, i + l is the physicochemical distance between the amino acids at positions i and i + l, as defined by Schneider and Wrede. 26 The physicochemical distance metric is defined the Euclidean distance between vectors comprising of four physicochemical properties for amino acids, that is, hydrophobicity, hydrophilicity, polarity, and side-chain volume.
The pseudo-amino acid composition vector V PseAAC is comprised of the PseAAC a and PseAAC a + l , and are defined as follows: PseAAC a = f a P 20 PseAAC a + l = wr a− 20 P 20 where f a is the frequency of amino acid a, r l = 1 ominated as the sequence order correlation factor, Θ(A i , A i + l ) is the correlation amino acid properties and w is an empirical weighting factor set to 2.5. The correlation factor describes the similarity between amino acids based on the average squared Euclidean distance between normalized hydrophobicity, hydrophilicity and side-chain mass values, as expressed by Equation (6) 25 : hydrophilicity and side-chain mass of the amino acid A i , obtained as follows: For the amphiphilic pseudo amino acid composition V APseAAC is comprised of the APseAAC a and APseAAC a + l , and are defined as follows: APseAAC a = f a P 20 APseAAC a + l = wr a P 20 where f a is the frequency of amino acid a, r l = 1 where l is the autocorrelation lag, P i and P i + l are properties of amino acid at position i and i + l, and P the average value of property P, P = P N i = 1 P i =N . As is evident in Equations (9), (10) and (11) Table 2.
Note that the Geary, Moran and normalized Moreau-Broto autocorrelation descriptors may in turn be employed as generalization schemes to other descriptors formalisms (eg, sequence order coupling derived descriptors) which yield vectors of amino acid/pair-wise contributions.

| Design and implementation
The PeptiDesCalculator is standalone software developed in Java programming language (version 1.8) and can thus be run on any operating system that has the Java Virtual Machine (JVM) installed. The PeptiDesCalculator integrates both the front-end and back-end layers.
The former contains the Graphic User Interface (GUI), which allows

| Front end: PeptiDesCalculator graphic user interface
The GUI was designed to allow for a simple and user-friendly configuration of the peptide molecular descriptors (MDs) computation. Figure 1

| Back end: Infrastructure for peptide descriptor computation
The tasks (descriptor calculation) determined by the client through the   Table 3 illustrates the total and average computation time, as well as speedup and efficiency metrics for the two descriptor groups.
It is evident from Table 3 that the total processing time generally decreases with an increase in the number of processors, and it can therefore be inferred that the parallel computing architecture was adequately implemented.

| Evaluation of Predictive Capacity of PeptiDesCalculator Indices
In order to assess the utility of the PeptiDesCalculator indices in the modeling of peptide bioactivity profiles, we selected eight endpoints, that is, Hepatitis C inhibitory activity (407 peptides  Speedup: ratio of the processing time for a baseline sequential workflow (ie, using one processor) to the time taken with a parallelized framework to execute the same task on n processors (n > 1). b Efficiency: ratio of speedup to the number of processors.
was developed. Bearing in mind that the reported metric for the different inhibitory profiles was the half-maximal inhibitory concentration (IC 50 ), 34  values >20 μM were labeled as inactives). For the rest of the endpoints, a threshold value of 10 μM was employed.

| Comparison with other approaches in the literature
Herein, we sought to evaluate the predictivity of the with more than 50 amino acids; their identity is provided in the input-Errors.log file. Following the same procedure discussed in the previous section and illustrated in Figure 3, antimicrobial activity classification F I G U R E 3 General workflow followed in the modeling of the nine peptide bioactivity endpoints in the present report models were built using the retrieved dataset. To approximate the dataset size employed in the reference study, the test dataset size was set to 33% of the entire dataset.

| CONCLUSION
The PeptiDesCalculator software provides a user-friendly platform for computing theoretical descriptors for peptide molecules. In light of the satisfactory performance of the models built with the PeptiDesCalculator indices, it may be inferred that these codify relevant peptide structural, chemical, and physicochemical information, useful in the prediction of peptide bioactivity profiles. It is hoped that this computational program will facilitate the development of in silico models for the prediction of peptide bioactivity, pharmacokinetic, and toxicological profiles and consequently guide the discovery, design, and optimization of therapeutically interesting peptides. The PeptiDesCalculator software is available for academic use upon request at info@protoqsar.com.