There is a newer version of the record available.

Published September 23, 2023 | Version v1
Software Open

The human genome's vocabulary as proposed by the DNA language model GROVER - the code to the paper

  • 1. TU Dresden

Description

The code to the preprint https://www.biorxiv.org/content/10.1101/2023.07.19.549677v1. 

Python was used for the model, performance assessment and data generation. R was used for scripting and data visualisation. All input data for the R scripts are separately provided, so that the data-intense and more intense computational steps do not have to be repeated. 
For the Python code, the folder finetuning_tasks has to be combined after decompression. It had to be split into four folders due to uploading problems.  

A tutorial on how to use GROVER as a foundation model can be found at: https://doi.org/10.5281/zenodo.8373159

The pretrained model can be found at: https://doi.org/10.5281/zenodo.8373117

The data for the tokenised genome are at: https://doi.org/10.5281/zenodo.8373053

Files

chr21.zip

Files (26.7 GB)

Name Size Download all
md5:2cf08e20a807f1c5a9d93f2885709b4a
130.8 MB Preview Download
md5:f183b4f382a9b1af3a3fb9f291e9261d
146.8 MB Preview Download
md5:9298ba6caea3430266fccc7c3c81ae39
2.9 GB Preview Download
md5:0bbb6b3b44babd90f378425aee7de018
6.8 GB Preview Download
md5:3030588b853e4d26447d7419624fb95b
4.3 GB Preview Download
md5:47a85174306ab13fbd193d53d1eb083b
3.9 GB Preview Download
md5:c8d3f84a410f46d4d8b03a268e313e35
205 Bytes Download
md5:534f27fd43b72e4f62321c1bcd21edd4
6.0 MB Download
md5:03ac98956497b1fb9b3feb42f9b13b24
99.8 kB Preview Download
md5:821cf6cf628cefff6e3ebe3fc9d40bba
14.7 MB Download
md5:40d5b3cc63ea1cff9c1fcf4fc6a975ae
46.3 kB Download
md5:e7c324b82ce7acc347d5c179222e7ea2
2.9 GB Preview Download
md5:b0ae2aaf99429b6502b2498839374512
4.4 GB Preview Download
md5:7cf9026e9e47291c6825c2ae9c026436
751 Bytes Preview Download
md5:ac202858cab4e478bae25ea4050ab800
4.3 kB Preview Download
md5:9f8cb6eb0b22d2befcdb1b9fa8e8b8ab
1.3 GB Preview Download

Additional details

Related works

Is cited by
Preprint: https://www.biorxiv.org/content/10.1101/2023.07.19.549677v1 (URL)
Is supplement to
Software documentation: 10.5281/zenodo.8373159 (DOI)
Requires
Software: 10.5281/zenodo.8373117 (DOI)
Dataset: 10.5281/zenodo.8373053 (DOI)