Gene-language models are whole genome representation learners

Naidenov, Bryan; Chen, Charles

doi:10.5061/dryad.vx0k6djzn

Published February 28, 2024 | Version v1

Dataset Open

Gene-language models are whole genome representation learners

1. Oklahoma State University

The language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from inherent semantic structure. We present a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species by leveraging natural contrastive characteristics between individuals. To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences. Their vector quantities encoded the relevant functional information necessary to achieve state-of-the-art genomic supervised prediction accuracy in 11 out of 12 antibiotic resistance phenotypes.

Notes

Funding provided by: National Science Foundation
Crossref Funder Registry ID: https://ror.org/021nxhr62
Award Number: 1826820

Files

narms_2017_and_2022_genespace.dir.zarr.zip

Files (52.3 MB)

Name	Size	Download all
narms_2017_and_2022_genespace.dir.zarr.zip md5:bfa993d72399cede27deb28ba8bb41f1	47.8 MB	Preview Download
narms_metadata_2017.csv md5:86c740c9aca07dda5534e34bf63a90fb	1.3 MB	Preview Download
narms_metadata_2022.csv md5:520804061703929a3a15c545c901b729	2.5 MB	Preview Download
narms_phenotypes_2017.csv md5:fa2a1379684c31de9fdd608b0e2cee9e	652.6 kB	Preview Download
README.md md5:85a7654691cc49e5ad41a3ce7158543b	12.8 kB	Preview Download

	All versions	This version
Views	159	159
Downloads	93	93
Data volume	941.0 MB	941.0 MB

Gene-language models are whole genome representation learners

Authors/Creators

Description

Notes

Files

narms_2017_and_2022_genespace.dir.zarr.zip

Files (52.3 MB)