GenomicLLM

Huaqing, Liu; Shuxian, Zhou; Peiyi, Chen; Jiahui, Liu; Ku-Geng, Huo; Lanqing, Han

doi:10.5281/zenodo.10695802

Published February 2024 | Version v2

Dataset Open

GenomicLLM

Motivation: With the rapid development of genomic sequencing technologies and accumulation of sequencing data, there is an increasing demand for analysis tools that are more user-friendly for non-programmer users. In support of this initiative, we developed an all-in-one tool called GenomicLLM that can understand simple grammar in the question input and perform different types of analyses and tasks accordingly.

Reaults: We trained the GenomicLLM model using three large open-access datasets, namely GenomicLLM_GRCh38, Genome Understanding Evaluation and GenomicBenchmarks, and developed a hybrid tokenization approach to allow better comprehension from mixed corpora that include sequence and non-sequence inputs. GenomicLLM can carry out a wider range of tasks. In the classification tasks that are also available in the state-of-the-art DNABERT-2 and HyenaDNA, GenomicLLM has comparable performance. Moreover, GenomicLLM can also carry out other regression and generation tasks that are not accomplishable by these tools. In summary, we demonstrated here a successful large language model with a mixture of gene sequences and natural language corpus that enables a wider range of applications.

Files

data.zip

Files (2.3 GB)

Name	Size	Download all
data.zip md5:4a145605ac4bc7e2008a84197611b116	1.3 GB	Preview Download
model.zip md5:33331cc64c7b1cdabe068641dcc5c841	1.0 GB	Preview Download

Additional details

Other: /

Is previous version of: Preprint: / (Other)

Submitted: 2024-02-23

Repository URL: https://github.com/Huatsing-Lau/GenomicLLM
Programming language: Python
Development Status: Active

/

	All versions	This version
Views	341	319
Downloads	144	142
Data volume	218.1 GB	216.1 GB

GenomicLLM

Files

data.zip

Files (2.3 GB)

Additional details

Identifiers

Related works

Dates

Software

References

GenomicLLM

Creators

Description

Files

data.zip

Files (2.3 GB)

Additional details

Identifiers

Related works

Dates

Software

References