Predicting Protein Interaction Sites Based on a New Integrated Radial Basis Functional Neural Network

Interactions among proteins are the basis of various life events. So, it is important to recognize and research protein interaction sites. A control set that contains 149 protein molecules were used here. Then 10 features were extracted and 4 sample sets that contained 9 sliding windows were made according to features. These 4 sample sets were calculated by Radial Basis Functional neutral networks which were optimized by Particle Swarm Optimization respectively. Then 4 groups of results were obtained. Finally, these 4 groups of results were integrated by Genetic Algorithm based Selected Ensemble (GASEN) and better accuracy was got. So, the integrated method was proved to be effective.


Introduction
Proteins are polymers that are made up of amino acids. They all have their unique three-dimensional structures and implement specific functions respectively. But they don't function alone. They complete a particular function via their interactions. Protein interactions control the various processes of life [1].
If we want to understand the principle of protein interactions, we must make clear that which parts of a protein participate in protein interactions firstly. So, this leads to the concept of interaction sites. An interaction site is an amino acid residue in a chain. If an amino acid residue is involved in an interaction, then it is defined as an interaction site. Otherwise, this amino acid residue is defined as an non-interaction site [2]. It is a two-type classification problem to predict protein interaction sites based on bioinformatics. Many authors had done a lot of work: Mile Sikic etal [3] listed 17 different features and used random forest to predict. Man Lan etal [4] adopted SVM to predict, but they mainly focused on feature generation and representations.
In this paper, we extracted 10 important features and used Radial Basis Functional (RBF) [5] neutral network as the classifier. Then, Genetic Algorithm based Selected Ensemble (GASEN) [6] was used to integrate the results generated by the single classifier and better accuracy was got.

Materials and Methods
Data Set. We selected a non-redundant control data set that contains 149 protein molecules. It includes 92 hetero-complexes and 57 homo-complexes. The data set can be available on the SPPIDER web site (http://spider.cchmc.org) [7].
Features. Feature is the first key to predict successfully. We extracted 10 different features: 1) sequence profiles (SP) [8,9]: it represents the relative frequency of an amino acid type at each position. It can be generated by multiple sequence alignment.
2) entropy (E)[9,10]: it is the measurement of sequence variability at one position. Here, it expressed the order among elements (amino acid residues).
3) relative entropy (RE)[9]: it is the normalization of entropy. It is changed between 0 and 100. 4) conservation weight (CW)[9]: it is the measurement of sequence conservation at one position. It is changed between 0 and 1. 5) complex accessible surface area (CASA) [2,9]: it expresses the total solvent exposure in a bound complex.
6) sequence variability (SV)[9]: it is on a scale of 0-100 and can be derived from the NALIGN alignments. 7) back-bone ASA (b-ASA) [3,11]: it was calculated by PSAIA. 8) side-chain ASA (s-ASA) [3,11]; 9) polar ASA (p-ASA) [3,11]; 10) non-polar ASA (n-ASA) [3,11]; Definition of Protein Interaction Sites. Usually, there are two methods can be used to define an interaction site [8]. We chose the first. Before protein complex forming, a residue exists in a monomer. So we call ASA of it as MASA. After complex formed, we call ASA of it as CASA. Then a residue in the complex was defined as a surface residue, if MASA / total ASA of a free amino acid ≥ 20% [12]. Total ASA of 20 amino acids were calculated by Huanxiang Zhou [13]. Finally, a residue was defined as an interaction site in the surface residues, if MASA-CASA ≥1( Å 2 ) [2]. The others were non-interaction sites. Here, the 4 sets were all made into 9 sliding windows, and they contained 20 * 9, 23 * 9, 25 * 9 and 29 * 9 values in each residue respectively.

RBF Neural Network and Integration.
Classifier is the second key of this subject. We adopted RBF neural network as the classifier. It only contains 3 layers. The RBF that we adopted was: ) (x H i are the results of the hidden layer. x are the input vectors. i c are the centers of the function. (2) j f are the final results of the entire neural network. ij ω are the weights which contact the second layer and the third layer. J represent the number of nodes in this layer. We used PSO [14] to optimize the parameters. In our RBF neural networks, the center and width of the function needed to be optimized. The weights that linked the second layer and the third layer needed to be optimized too. So, these parameters should be included in each particle of PSO.
Finally, GASEN was used to integrate the results of RBF neural networks. It is the extension of generalized ensemble method (GEM). GEM is calculated by the following formula: (3)

Environmental Biotechnology and Materials Engineering
i ω is the weight and it is changed between 0 and 1. The sum of n i ω is 1. In GASEN, the weights were optimized by Genetic Algorithm. It simultaneously optimized 4 weights each cycle.

Experiments and Results
Evaluation of The Results. First, we defined the followings: TP: the number of interaction sites that were predicted correctly; TN: the number of non-interaction sites that were predicted correctly; FP : the number of non-interaction sites that were predicted as interaction sites; FN : the number of interaction sites that were predicted as non-interaction sites; N: the number of all sites in a protein molecule.
The following metrics were used to evaluate the prediction results [2,7]: sensitivity of the positive data: ).  FN  TN  FP  TP  FP  TN  FN  TP   FP  FN  TN  TP Experimental Results. We adopted 10 cross-validation. The 149 protein molecules were divided into 10 groups (the last group contained 14 molecules). Each time, one group was selected as test set and the remaining groups were train set. Thus every sample set was carried out 10 times.(see Fig. 1)  Table 1) On the SPPIDER web site, their final accuracy were 72.48% and 74.18%. Our posterior 4 results were all better than their's.
In the end, we used 1qz8 to validate our method. It is a fragment of SARS corona virus NSP9. There are two chains (A, B) in this molecule and we used the first. The chain contains 111 amino acid residues and we predicted 96 ones by GASEN correctly. (see Fig. 2  Blue showed the interaction sites that were predicted correctly. Red showed the non-interaction sites that were predicted correctly. Yellow (15 ones) showed the residues that were not predicted correctly.

Conclusion
In this paper, we extracted 10 features and some were new. The 4 sample sets that we created were also new and unique. Finally, we demonstrated our new ideas. We hope that more and more different and new methods about computational intelligence and biology can be applied to this subject in the future.

390
Environmental Biotechnology and Materials Engineering