Tissue Classification Using RNA Splice Junctions with Distribution Analysis and Machine Learning Models
Authors/Creators
- 1. School of Data Science and Psychology Department University of Virginia Charlottesville, USA
Description
The human body generates more proteins than it has genes that code for proteins. The diversity of proteins stems from the alternative ways in which RNA can be spliced and reassembled. Each alternative version of RNA produces a different protein, providing a way for our bodies to produce a wide range of proteins with a single gene. Some alternative RNA transcripts, however, have splicing errors and produce faulty proteins involved in genetic diseases. Understanding splicing patterns and profiles has wide implications for our understanding of healthy and diseased tissue states. Currently little is known regarding the splicing profiles of healthy tissue which vary across individuals and within individuals by tissue type. Therefore, this project explored the use of RNA splicing data from the first chromosome to predict the tissue type of non-cancerous samples using distribution analysis and supervised learning methods. The Kolmogorov-Smirnov test was used to classify the samples based on empirical cumulative distribution functions and was not able to reliably distinguish between tissue types. However using Support Vector Models (SVM) we had high classification accu- racy, even when using different splice junction representations. Overall, the findings suggest the utility of using splice junction data in biological classification and sets the foundation for future work of mapping splicing patterns with phenotype.
Files
DeCanio_Tissue_Classification.pdf
Files
(448.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:ca00e0d977404339b6fce0c6793bd94f
|
448.2 kB | Preview Download |