Published February 28, 2024 | Version v1
Dataset Open

Gene-language models are whole genome representation learners

  • 1. Oklahoma State University

Description

The language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from inherent semantic structure. We present a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species by leveraging natural contrastive characteristics between individuals. To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences. Their vector quantities encoded the relevant functional information necessary to achieve state-of-the-art genomic supervised prediction accuracy in 11 out of 12 antibiotic resistance phenotypes.

Notes

Funding provided by: National Science Foundation
Crossref Funder Registry ID: https://ror.org/021nxhr62
Award Number: 1826820

Files

narms_2017_and_2022_genespace.dir.zarr.zip

Files (52.3 MB)

Name Size Download all
md5:bfa993d72399cede27deb28ba8bb41f1
47.8 MB Preview Download
md5:86c740c9aca07dda5534e34bf63a90fb
1.3 MB Preview Download
md5:520804061703929a3a15c545c901b729
2.5 MB Preview Download
md5:fa2a1379684c31de9fdd608b0e2cee9e
652.6 kB Preview Download
md5:85a7654691cc49e5ad41a3ce7158543b
12.8 kB Preview Download