Journal article Open Access
Shohei Maruyama; Yasuo Matsuyama; Sachiyo Aburatani
Development of a method to estimate gene functions is
an important task in bioinformatics. One of the approaches for the
annotation is the identification of the metabolic pathway that genes are
involved in. Since gene expression data reflect various intracellular
phenomena, those data are considered to be related with genes’
functions. However, it has been difficult to estimate the gene function
with high accuracy. It is considered that the low accuracy of the
estimation is caused by the difficulty of accurately measuring a gene
expression. Even though they are measured under the same condition,
the gene expressions will vary usually. In this study, we proposed a
feature extraction method focusing on the variability of gene
expressions to estimate the genes' metabolic pathway accurately. First,
we estimated the distribution of each gene expression from replicate
data. Next, we calculated the similarity between all gene pairs by KL
divergence, which is a method for calculating the similarity between
distributions. Finally, we utilized the similarity vectors as feature
vectors and trained the multiclass SVM for identifying the genes'
metabolic pathway. To evaluate our developed method, we applied the
method to budding yeast and trained the multiclass SVM for
identifying the seven metabolic pathways. As a result, the accuracy
that calculated by our developed method was higher than the one that
calculated from the raw gene expression data. Thus, our developed
method combined with KL divergence is useful for identifying the
genes' metabolic pathway.
C. Cortes and V. Vapnik, "Support-vector networks," Mach. Learn., vol. 20, pp. 273-297, 1995.
C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machine," ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1-27, Apr. 2011.
E. Hubbell, W. M. Liu, and R. Mei, "Robust estimators for expression analysis," Bioinformatics, vol. 18, pp. 1585-1592, 2002.  S. D. Pepper, E. K. Saunders, L. E. Edwards, C. L. Wilson, and C. J. Miller, "The utility of MAS5 expression summary and detection call algorithms," BMC Bioinformatics, vol. 8, p. 273, 2007.  M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe, "Data, information, knowledge and principle: back to metabolism in KEGG," Nucleic Acids Res., vol. 42, no. Database issue, pp. D199-205, Jan. 2014.
K. Aoki, Y. Ogata, and D. Shibata, "Approaches for extracting practical information from gene co-expression networks in plant biology," Plant Cell Physiol., vol. 48, no. 3, pp. 381-390, Mar. 2007.
K. Saito, M. Y. Hirai, and K. Yonekura-Sakakibara, "Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'," Trends Plant Sci., vol. 13, no. 1, pp. 36-43, Jan. 2008.
M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler, "Knowledge-based analysis of microarray gene expression data by using support vector machines," Proc. Natl. Acad. Sci. U. S. A., vol. 97, no. 1, pp. 262-267, Jan. 2000.
R. Edgar, M. Domrachev, and A. E. Lash, "Gene Expression Omnibus: NCBI gene expression and hybridization array data repository," Nucleic Acids Res., vol. 30, pp. 207-210, 2002.
S. Kullback, and R. A. Leibler, "On information and sufficiency," Annals of Mathematical Statistics, vol. 22, pp. 79-86, 1951.
T. Obayashi, Y. Okamura, S. Ito, S. Tadaka, Y. Aoki, M. Shirota, and K. Kinoshita, "ATTED-II in 2014: evaluation of gene coexpression in agriculturally important plants," Plant Cell Physiol., vol. 55, no. 1, p. e6, Jan. 2014.