Independent Component Analysis (ICA) Based-Clustering of temporal RNA-seq Data

Moysés Nascimento; Fabyano Fonseca e Silva; Thelma Sáfadi; Ana Carolina Campana Nascimento; Talles Eduardo Maciel Ferreira; Laís Mayara Azevedo Barroso; Camila Ferreira Azevedo; Simone Eliza Faccione Guimarães; Nick Vergara Lopes Serão

doi:10.5281/zenodo.571134

Published May 3, 2017 | Version v1

Journal article Open

Independent Component Analysis (ICA) Based-Clustering of temporal RNA-seq Data

1. Department of Statistics, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
2. Department of Animal Science, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
3. Department of Exact Sciences, Federal University of Lavras, Lavras, Minas Gerais, Brazil
4. Departament of Animal Science, Iowa State University, Ames, Iowa, USA

Gene expression time series (GETS) analysis aims to characterize sets of genes according to their longitudinal patterns of expression. Due to the large number of genes evaluated in GETS analysis, an useful strategy to summarize biological functional processes and regulatory mechanisms is through clustering of genes that present similar expression pattern over time. Traditional cluster methods usually ignore the challenges in GETS, such as the lack of data normality and small number of temporal observations. Independent Component Analysis (ICA) is a statistical procedure that uses a transformation to convert raw time series data into sets of values of independent variables, which can be used for cluster analysis to identify sets of genes with similar temporal expression patterns. ICA allows clustering small series of distribution-free data while accounting for the dependence between subsequent time-points. Using temporal simulated and real (four libraries of two pig breeds at 21, 40, 70 and 90 days of gestation) RNA-seq data set we present a methodology (ICAclust) that jointly considers independent components analysis (ICA) and a hierarchical method for clustering GETS. We compare ICAclust results with those obtained for K-means clustering. ICAclust presented, on average, an absolute gain of 5.15% over the best K-means scenario. Considering the worst scenario for K-means, the gain was of 84.85%, when compared with the best ICAclust result. For the real data set, genes were grouped into six distinct clusters with 89, 51, 153, 67, 40, and 58 genes each, respectively. In general, it can be observed that the 6 clusters presented very distinct expression patterns. Overall, the proposed two-step clustering method (ICAclust) performed well compared to K-means, a traditional method used for cluster analysis of temporal gene expression data. In ICAclust, genes with similar expression pattern over time were clustered together.

All dataset related to simulation (replicate1.docx, replicate2.docx,...,replicate10.docx) and real data (RNA_seq_Pig.docx), as well as the R software codes (ICAclust.docx) are available.

Files

Files (1.7 MB)

Name	Size	Download all
icaclust-r.docx md5:e68056114203c0c2b7ed2c9eb0634e6b	92.0 kB	Download
replicate1-txt.docx md5:7614c09172aaf893aa53b27fe89c1157	145.5 kB	Download
replicate10-txt.docx md5:318c9713540a38ae3eb7d7e57658ec2e	147.1 kB	Download
replicate2-txt.docx md5:b2ab6b6d00df480e5cf3bb90aeb0b8af	145.6 kB	Download
replicate3-txt.docx md5:ce919f42382a49321f2cb12b50574ac2	145.0 kB	Download
replicate4-txt.docx md5:8b69ce785b78deafc3f2fae315269a71	146.3 kB	Download
replicate5-txt.docx md5:24c41894312903f54013f3740f3be58d	145.0 kB	Download
replicate6-txt.docx md5:838a9bf8eab37df9c5abcdaf8d298ecf	146.9 kB	Download
replicate7-txt.docx md5:2e644940927bbb305efe84570533efe2	145.2 kB	Download
replicate8-txt.docx md5:7be73a7380569e72e2a73cafca2c0051	145.7 kB	Download
replicate9-txt.docx md5:a7303ab06582d7b1a043aae6fe7de575	145.1 kB	Download
rna_seq_pig-txt.docx md5:19b5967bfb16796906ba5bdf72f2523e	140.7 kB	Download

	All versions	This version
Views	1,437	1,432
Downloads	985	982
Data volume	162.2 MB	161.4 MB

Independent Component Analysis (ICA) Based-Clustering of temporal RNA-seq Data

Creators

Description

Files

Files (1.7 MB)