The human genome's vocabulary as proposed by the DNA language model GROVER - the code to the paper
Description
The code to the preprint https://www.biorxiv.org/content/10.1101/2023.07.19.549677v1.
Python was used for the model, performance assessment and data generation. R was used for scripting and data visualisation. All input data for the R scripts are separately provided, so that the data-intense and more intense computational steps do not have to be repeated.
For the Python code, the folder finetuning_tasks has to be combined after decompression. It had to be split into four folders due to uploading problems.
A tutorial on how to use GROVER as a foundation model can be found at: https://doi.org/10.5281/zenodo.8373159
The pretrained model can be found at: https://doi.org/10.5281/zenodo.8373117
The data for the tokenised genome are at: https://doi.org/10.5281/zenodo.8373053
Files
chr21.zip
Files
(26.7 GB)
Name | Size | Download all |
---|---|---|
md5:2cf08e20a807f1c5a9d93f2885709b4a
|
130.8 MB | Preview Download |
md5:f183b4f382a9b1af3a3fb9f291e9261d
|
146.8 MB | Preview Download |
md5:9298ba6caea3430266fccc7c3c81ae39
|
2.9 GB | Preview Download |
md5:0bbb6b3b44babd90f378425aee7de018
|
6.8 GB | Preview Download |
md5:3030588b853e4d26447d7419624fb95b
|
4.3 GB | Preview Download |
md5:47a85174306ab13fbd193d53d1eb083b
|
3.9 GB | Preview Download |
md5:c8d3f84a410f46d4d8b03a268e313e35
|
205 Bytes | Download |
md5:534f27fd43b72e4f62321c1bcd21edd4
|
6.0 MB | Download |
md5:03ac98956497b1fb9b3feb42f9b13b24
|
99.8 kB | Preview Download |
md5:821cf6cf628cefff6e3ebe3fc9d40bba
|
14.7 MB | Download |
md5:40d5b3cc63ea1cff9c1fcf4fc6a975ae
|
46.3 kB | Download |
md5:e7c324b82ce7acc347d5c179222e7ea2
|
2.9 GB | Preview Download |
md5:b0ae2aaf99429b6502b2498839374512
|
4.4 GB | Preview Download |
md5:7cf9026e9e47291c6825c2ae9c026436
|
751 Bytes | Preview Download |
md5:ac202858cab4e478bae25ea4050ab800
|
4.3 kB | Preview Download |
md5:9f8cb6eb0b22d2befcdb1b9fa8e8b8ab
|
1.3 GB | Preview Download |
Additional details
Related works
- Is cited by
- Preprint: https://www.biorxiv.org/content/10.1101/2023.07.19.549677v1 (URL)
- Is supplement to
- Software documentation: 10.5281/zenodo.8373159 (DOI)
- Requires
- Software: 10.5281/zenodo.8373117 (DOI)
- Dataset: 10.5281/zenodo.8373053 (DOI)