Published May 2025 | Version v5
Dataset Open

Supplemental datasets for: 'Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras'

Authors/Creators

Contributors

Data manager:

  • 1. ROR icon New York University Abu Dhabi

Description

This repository contains supplementary datasets for 'Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras'. This version also contains the original vector graphics for the figures presented in the main text.

 

Data S1 | Neural network training sequences (4GB). Dataset comprising DNA sequences processed in two formats: with terminal information (TI-inclusive) and terminal information-free (TI-free).


Data S2 | Training and inference scripts. The LA4SR framework integrated several open-source software packages and models. We employed LORA (Low-Rank Adaptation (adapted from: https://github.com/microsoft/LoRA)), PEFT (Parameter-Efficient Fine-Tuning (https://github.com/huggingface/peft), and QLORA (Quantized Low-Rank Adaptation (adapted from: https://github.com/artidoro/qlora) for parameter-efficient post-training and used Mamba (https://github.com/state-spaces/mamba) as an alternative to transformer-based architectures. The Hugging Face Transformers library (https://github.com/huggingface/transformers) facilitated implementation, pretraining, and post-training of the open-source models. Training was performed on an HPC cluster, with jobs going to nodes with NVIDIA (Santa Clara, CA, USA) V100, A100, or H100 GPUs. Dataset available at: 10.5281/zenodo.13920001.


Data S3 | Interpretability scripts, including DeepMotifMinerPro. Includes scripts for the implementation of the custom explainer programs presented with this work, including Captum-, DeepLift, and SHAP-based approaches (Data S3) to explain how different amino acid residues and their patterns and positions affect model decisions.


Data S4 | Additional validations using new assemblies from seen species. We cultured and sequenced ten separate isogenic colonies of Chlamydomonas reinhardtii CC-1883. Of these, nine were sequenced with Illumina 150 bp paired-end short reads and one with Pacific Biosciences (PacBio, Menlo Park, CA, USA) HiFi reads and DoveTail (Sydney, Australia) Hi-C to generate a complete, axenic reference assembly. These data were also uploaded to NCBI (SAMN44618602).

 

Data S5 | Singularity container to run LA4SR. Includes an environment supporting LA4SR.

Files

FIGURE_1-25MAY2025.pdf

Files (16.2 GB)

Name Size Download all
md5:c6adc89e86e8d0da44e1be1a9cc9e3e5
5.9 GB Download
md5:50062123a7f64e1b945203a44e608f91
17.5 kB Download
md5:9d94f9308dfca8957d2c1077a50e9a58
29.5 kB Download
md5:904077cfec9c910904330b7cffb0a6df
435.1 MB Download
md5:731b802117bfad1286a40eb5d19626b6
9.8 GB Download
md5:8bed5f7faa910cf89dc5c6d5b75f4566
8.9 MB Preview Download
md5:a8e191fc1683d43059c10ae09f20ffc5
1.6 MB Preview Download
md5:c14c1ba3cbfaa6a35d52719186d8eaf7
15.3 MB Preview Download
md5:fee144d075e7e03a3ee260af993e0aaa
2.6 MB Preview Download
md5:3614d18369de8d834234a4306147918d
731.5 kB Preview Download
md5:8ecb4867b03e63a6145b6c4d325d830f
684.0 kB Preview Download
md5:aabc244df4fd588e640f424969de9256
7.5 MB Preview Download

Additional details