Method FCMSPimpute for Estimation of Missing Values in Microarray Data

Gene expression microarray experiments produce datasets with numerous missing expression value, which can significantly affect the performance of statistical and machine learning algorithms. In this paper, we proposed a novel method based on the fuzzy clustering and the shortest path algorithm for measuring the semantic similarity on GO to estimate missing value to microarray gene expression. In this proposed method, missing values are imputed with values generated from cluster centers. Genes similarity in clustering process resolute based on both the GO structure information and , s property. We have applied the proposed method on two datasets with different percentages of missing values. The experimental results indicate that proposed method provides a higher accuracy of missing value estimation because the semantic similarity obtained by sp algorithm better correlates with the expression similarity than other node-based methods.


INTRODUCTION
Gene expression microarray provides a popular technique to monitor the relative expression of thousands of genes under a variety of experimental conditions [1].Gene expression microarray experiments can generate datasets with multiple missing expression values due to various reason, e.g.insufficient resolution, image corruption, dust or scratches on slides, or experimental error during the laboratory process.Datasets are an m*n gene expression matrix with m genes end n experiments.Unfortunately, many algorithms for gene expression analysis need a complete matrix of gene array values as input [2] [ 3].Therefore, these missing values need to be fill because each value is important to determine the validity and accuracy of a special analysis.
Certainly, there are many strategies to deal with missing values such as: filling the missing values with zeros; using the row mean for imputation.These methods produce inaccurate estimating values.Then a number of complicated approaches have been proposed to predict missing value [4] such as K-Nearest Neighbor (KNN).There are two steps in KNN imputation.The first step is to select k genes with expression profiles similar to missing value.the second step is estimated by a weighted average of the known values in the j-th experiment of the k selected genes but their disadvantage is depend on K parameter.

RELATED WORK
We use cluster-based algorithms for estimating missing value because they don't need user to determine parameters [5] and another limitation of the existing estimation methods for microarray estimation is that they use no external information and the estimation is based solely on the expression data [6].The gene ontology (GO) has become one of the most important and usefull resources in bioinformatics.Fuzzy C-means clustering algorithm and gene ontology (FCMGOimpute) have been proposed to avert the problem of those methods [7].
In this method two genes will be similar if they have the same annotations.This similar measure is not good enough.In [8] Fuzzy C-means clustering algorithm and semantic similarity (FCMSSimpute) have used semantic similarity for measure similarity between genes.
In this paper, we change semantic similarity method on GO to measure for increasing accuracy of imputation because in previously method similarity use only information content but in this paper we use structure and IC information to compute semantic similarity.
The remainder of this paper can be described as follows: Next section contains a description of FCMGOimpute method.In section 3 the proposed methodology and the semantic similarity methods are described and in section 4 discusses the results of method applied on yeast cell cycle dataset.The paper ends with conclusions and future work.

FCMGOimpute
Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes .clusteringmethods such as K-means or Self-Organizing Maps assign each gene to a single cluster.However, these methods do not provide information about the influence of a given gene for the overall shape of clusters.Here we apply a fuzzy partitioning method, Fuzzy C-Means (FCM), to attribute cluster membership values to the genes, where single genes may belong to several clusters.In fuzzy clustering, each point has a membership degree to cluster between 0 or 1 [7].

Let X=
, ,…, be the set of given examples and let C be the number of clusters.Then degree of belonging of data object to cluster i is defined as , which explain the bellow constraints: ∑ > 0 (1) ∑ = 1 (2) Fuzzy C-means clustering is based on minimization of the following objective function: ( , ) = ∑ ∑ (3) Where m is the fuzziness parameter which is a real number greater than 1 and is the Euclidean distance between data object and cluster center i which is defined by [6]: Where s is the feature space dimension, = ∑ and B is defined based on gene ontology annotation of gene k and gene t, as follows: We calculate cluster canters and membership by Eq. ( 6) and ( 7) for minimize the objective function shown: To determine the fuzziness parameter (m) and the number of clusters (c), some methods were proposed in [7].In FCMGOimpute two genes are similar, if they have same annotation, and they are dissimilar if their annotations are different.According to definition similarity value will be 0 or 1 but semantic similarity is not crisp and a real value between 0 or 1.

Semantic Similarity Methods
Semantic similarity measures can be used to calculate the similarity of two concepts in ontology.There are three types of approache to compare this term in a ontology such as Go: edge based, node-based, and hybrid of the both.1-Edge based approaches [9] are based mainly on counting the number of edges in the graph path between two term.2-Node based approaches are based mainly on information content(IC).The IC value for a term t is defined as: Where p(t) is the probability of occurrence of the term t in a certain corpus.All node based on methods are defined based on the IC value of the terms involved.The methods are shown to measure the semantic similarity for GO term: Rensink's method [10] is shown in Eq(9): Lin's [11] and Jaing's method [12] is shown in Eq(10, 11): Where t is the most informative common ancestor (MICA) of and .Edge based methods, the weights of the edges conflict with the property of GO, and for node-based methods, only IC value of the two terms and their MCIA are considered regardless of their position in GO. 3-Hybrid methods that consider both the substructure of GO and the property of terms involved, unlike the existing edge based( node based) methods that use only structure (IC) information.

Proposed Method
In FCMSSimpute have been used Lin's method to compute semantic similarity which is nodebased and they rely on the property of the terms which is represented using the concept derived from information theory [8].We used a new hybrid method namely shortest path algorithm for measuring the semantic similarity on GO (FCMSPimpute) that contains more information than the single IC values used in the node-based algorithm and the weights assigned to the substructure which are more consistent with common interpretation than the previous edgebased methods.
The shortest path algorithm can be described as follow [13].Given two term and , the normalized semantic similarity between than is defined as: The function normalizes the distance obtained by summing the weight of the terms on the shortest path [0-1].SP algorithm weights each term using the value of .
We will use gene semantic similarity as [14]: We modify the calculation of Euclidean distance in Eq. ( 14) as follow: Calculation of cluster centers and membership degree is the Eq. ( 6) and (7).We impute missing values by making use of the weighted mean of the values of the corresponding attribute over all clusters.The weighting factors are the membership degrees of a gene to the cluster.The missing gene expression value is imputed by: = ∑ ∑ (15)

Experimental Results
The time series dataset are used to evaluate propose method [15].So we have conducted extensive experiments on dataset from the UCI machine learning repository [16].We evaluate our method on the dataset abalone, Which contains 8 continuous attributes and 4177 instances in total.The genes annotation were retrieved from the Saccharomyces Genome Database (SGD) [17].In our experiments, we used the annotations from BP ontology.We deleted rows with missing value to achieve a complete dataset as test dataset.The performance of the missing value estimation is evaluated by the Root Mean Squared Error (RMSE): Where and are the real value the estimated value, respectively, and n is the number of missing value.We have applied FCMimpute, FCMGOimpute, FCMSSimpute and our proposed method (FCMSPimpute) on yeast cell cycle data with different percentage of missing values and compared the accuracy of them by means of RMSE.The result experiments are shown in the figure 1, 2. FCMimpute method does not use any useful external information and uses just microarray data for imputation process.As it can be seen from the results, FCMimpute has a lower performance, compared to other method.FCMGOimpute gene ontology annotation as an extra knowledge but similar concept is 0 or 1. FCMSSimpute and FCMSPimpute have lower RMSE.The proposed method (FCMSPimpute) we use gene ontology annotation as an external information, and semantic similarity on GO based on both the GO structure information and term , s property.FCMSPimpute has better performance over FCMSSimpute because the semantic similarity obtained by sp algorithm better correlates with the expression similarity than other node-based methods.

CONCLUSIONS
In this paper, an efficient method for estimating missing values in microarray data, namely FCMSPimpute, is proposed.We use benefit of the correlation structure of the data to estimate missing expression values by clustering, as well as using gene semantic similarity which improves the imputation accuracy.
We have explained the performance of our method on two datasets and compared the accuracy with FCMimpute, FCMGOimpute, FCMSSimpute and FCMSPimpute.The result experiments are shown, the proposed method (FCMSPimpute) out performs other methods in terms of accuracy.In FCMSPimpute, we have changed semantic similar method and We have considered two genes similar if they have same biological process annotations instead Molecular Function.

Figure 1 .
Figure 1.Comparison of the accuracy of FCMimpute, FCMGOimpute, FCMSSimpute, FCMSPimpute methods by RMSE for time series dataset over 1 to 20% missing data.