Feasibility of predicting allele specific expression from genetic variants using machine learning
Description
Background:
Allele-specific expression (ASE) refers to divergent abundances of allelic copies of a transcript and is quantified by RNA sequencing. Multiple studies have shown that ASE likely plays a role in many diseases by modulating the pathogenesis or phenotypes. However, routine genome diagnostics is based on DNA sequencing and therefore neglects the regulation of gene expression such as ASE. To take advantage of ASE information in absence of RNA sequencing, ASE must be predicted using only DNA variation.
Results:
We have constructed ASE models from BIOS and GTEx that predict local ASE using DNA features. These models are highly stable as shown by performance metrics, cross-validation, and feature importance. Many different types of features are involved in these predictions, highlighting the complex regulation that underlies ASE. We applied the BIOS-trained model to population variants in three genes in which ASE plays a role in disease penetrance or modulation: BRCA2, RET, and NF1. This resulted in predicted ASE effects for 27 out of 8,957 variants. Ten of these were known pathogenic variants, while the rest are interesting candidates for further elucidation of disease etiology.
Conclusions:
We demonstrated that ASE can be stably and reliably predicted from DNA features using machine-learning models. Future efforts may further improve sensitivity and translate these models into a new class of genome diagnostic in silico prediction tools to prioritize candidate pathogenic variants or regulators thereof for follow-up validation by RNA sequencing.