Supplemental datasets for: 'Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras'
Authors/Creators
Description
This repository contains supplementary datasets for 'Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras'. This version also contains the original vector graphics for the figures presented in the main text.
Data S1 | Neural network training sequences (4GB). Dataset comprising DNA sequences processed in two formats: with terminal information (TI-inclusive) and terminal information-free (TI-free).
Data S2 | Training and inference scripts. The LA4SR framework integrated several open-source software packages and models. We employed LORA (Low-Rank Adaptation (adapted from: https://github.com/microsoft/LoRA)), PEFT (Parameter-Efficient Fine-Tuning (https://github.com/huggingface/peft), and QLORA (Quantized Low-Rank Adaptation (adapted from: https://github.com/artidoro/qlora) for parameter-efficient post-training and used Mamba (https://github.com/state-spaces/mamba) as an alternative to transformer-based architectures. The Hugging Face Transformers library (https://github.com/huggingface/transformers) facilitated implementation, pretraining, and post-training of the open-source models. Training was performed on an HPC cluster, with jobs going to nodes with NVIDIA (Santa Clara, CA, USA) V100, A100, or H100 GPUs. Dataset available at: 10.5281/zenodo.13920001.
Data S3 | Interpretability scripts, including DeepMotifMinerPro. Includes scripts for the implementation of the custom explainer programs presented with this work, including Captum-, DeepLift, and SHAP-based approaches (Data S3) to explain how different amino acid residues and their patterns and positions affect model decisions.
Data S4 | Additional validations using new assemblies from seen species. We cultured and sequenced ten separate isogenic colonies of Chlamydomonas reinhardtii CC-1883. Of these, nine were sequenced with Illumina 150 bp paired-end short reads and one with Pacific Biosciences (PacBio, Menlo Park, CA, USA) HiFi reads and DoveTail (Sydney, Australia) Hi-C to generate a complete, axenic reference assembly. These data were also uploaded to NCBI (SAMN44618602).
Data S5 | Singularity container to run LA4SR. Includes an environment supporting LA4SR.
Files
FIGURE_1-25MAY2025.pdf
Files
(16.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c6adc89e86e8d0da44e1be1a9cc9e3e5
|
5.9 GB | Download |
|
md5:50062123a7f64e1b945203a44e608f91
|
17.5 kB | Download |
|
md5:9d94f9308dfca8957d2c1077a50e9a58
|
29.5 kB | Download |
|
md5:904077cfec9c910904330b7cffb0a6df
|
435.1 MB | Download |
|
md5:731b802117bfad1286a40eb5d19626b6
|
9.8 GB | Download |
|
md5:8bed5f7faa910cf89dc5c6d5b75f4566
|
8.9 MB | Preview Download |
|
md5:a8e191fc1683d43059c10ae09f20ffc5
|
1.6 MB | Preview Download |
|
md5:c14c1ba3cbfaa6a35d52719186d8eaf7
|
15.3 MB | Preview Download |
|
md5:fee144d075e7e03a3ee260af993e0aaa
|
2.6 MB | Preview Download |
|
md5:3614d18369de8d834234a4306147918d
|
731.5 kB | Preview Download |
|
md5:8ecb4867b03e63a6145b6c4d325d830f
|
684.0 kB | Preview Download |
|
md5:aabc244df4fd588e640f424969de9256
|
7.5 MB | Preview Download |
Additional details
Software
- Repository URL
- https://huggingface.co/GreenGenomicsLab