Dataset: Interpretable multimodal learning from sequence and genomic context for lncRNA classification
Authors/Creators
Description
This dataset contains all data files and experimental results associated with the manuscript "Interpretable multimodal learning from sequence and
genomic context for lncRNA classification".
Related Code Repository: https://github.com/cbib/beta_vae_lnclassifier
Code Archive DOI: 10.5281/zenodo.18833347
Manuscript: [Citation when available]
File Organization
The dataset is organized into three ZIP archives:
data.zip- Input data files and preprocessing outputsgencode_v47_experiments.zip- All experimental results on GENCODE v47gencode_v49_experiments.zip- All experimental results on GENCODE v49
1. data.zip
Contains all input sequences, features, and lncRNA-BERT baseline results.
Contents:
cdhit_clusters/
CD-HIT clustered transcript sequences for training:
v47_lncRNA_clustered.fa- CD-HIT clustered lncRNA sequences (v47)v47_pc_clustered.fa- CD-HIT clustered protein-coding sequences (v47)v49_lncRNA_clustered.fa- CD-HIT clustered lncRNA sequences (v49)v49_pc_clustered.fa- CD-HIT clustered protein-coding sequences (v49)
dataset_biotypes/
Biotype annotations for datasets:
v47_dataset_biotypes_cdhit.csv- Transcript biotype labels (v47)v49_dataset_biotypes_cdhit.csv- Transcript biotype labels (v49)
Format: CSV with columns including transcript_id, biotype, gene_id
lncRNABERT_results/
Zero-shot baseline results from lncRNA-BERT:
v47_lncRNABERT_embeddings.h5- Learned embeddings (v47)v47_lncRNABERT_results.csv- Predictions and metrics (v47)v49_lncRNABERT_embeddings.h5- Learned embeddings (v49)v49_lncRNABERT_results.csv- Predictions and metrics (v49)
processed_features/
Cleaned and normalized feature vectors with associated metadata:
v47_nonb_feature_names.txt- Non-B DNA feature names (v47)v47_nonb_features_clean.csv- Processed non-B DNA features (v47)v47_nonb_scaler.pkl- Scikit-learn scaler for non-B features (v47)v47_te_feature_names.txt- TE feature names (v47)v47_te_features_clean.csv- Processed TE features (v47)v47_te_scaler.pkl- Scikit-learn scaler for TE features (v47)v49_nonb_feature_names.txt- Non-B DNA feature names (v49)v49_nonb_features_clean.csv- Processed non-B DNA features (v49)v49_nonb_scaler.pkl- Scikit-learn scaler for non-B features (v49)v49_te_feature_names.txt- TE feature names (v49)v49_te_features_clean.csv- Processed TE features (v49)v49_te_scaler.pkl- Scikit-learn scaler for TE features (v49)
Description: Feature scalers (.pkl) can be loaded with scikit-learn to apply the same normalization used during training.
split_gencode_47/
Train/test split for GENCODE v47:
lnc_test.fa- lncRNA test setlnc_trainval.fa- lncRNA training+validation setpc_test.fa- Protein-coding test setpc_trainval.fa- Protein-coding training+validation setsplit_manifest.json- Split metadata and statistics
split_gencode_49/
Train/test split for GENCODE v49 (same structure as split_gencode_47/)
2. gencode_v47_experiments.zip
Experimental results for all models trained and evaluated on GENCODE v47.
Contents:
beta_vae_contrastive_g47/
β-VAE with contrastive learning (sequence-only baseline):
evaluation_csvs/- Evaluation metrics and predictionsglobal_biotype_enrichment/- Biotype enrichment analysismodels/- Model checkpointsperformance_figures/- Performance visualization plotsspatial_analysis/- Spatial clustering analysisumap_visualizations/- UMAP embedding visualizationsANALYSIS_SUMMARY.md- Summary of key findingsbiotype_mapping.json- Biotype label mappingscv_evaluation_results.json- Cross-validation resultscv_fold_results.csv- Per-fold cross-validation metricsembeddings_all_folds.npz- Concatenated embeddings from all CV foldsembeddings_best_fold.npz- Embeddings from best performing foldmodel_architecture.txt- Model architecture descriptionmodel_paths.csv- Paths to saved model filestest_results.json- Final test set results
beta_vae_features_attn_g47/
β-VAE with attention-based feature fusion (TE + non-B DNA): Same structure as beta_vae_contrastive_g47/
beta_vae_features_g47/
β-VAE with concatenated features (TE + non-B DNA): Same structure as beta_vae_contrastive_g47/
cnn_g47/
CNN baseline (sequence-only): Same structure as beta_vae_contrastive_g47/
stat_results/
Statistical analysis results across all models:
ablations_v47/
Ablation study results:
bootstrap_f1_ci.csv- Bootstrap confidence intervals for F1 scoresdelongauc_ci.csv- DeLong test for AUC comparisonsfold_summary.csv- Summary statistics per fold
g47/
GENCODE v47 statistical analysis:
g47_bootstrap_f1_ci.csv- Bootstrap F1 confidence intervalsg47_fold_summary.csv- Per-fold summary statisticshardcase_jaccard_pairwise_v47.csv- Jaccard similarity for hard caseshardcase_jaccard_v47.csv- Hard case Jaccard indiceshardcase_membership_long_v47.csv- Hard case membership matrixhardcase_upset_v47.png- UpSet plot for hard case overlaps
3. gencode_v49_experiments.zip
Experimental results for all models trained and evaluated on GENCODE v49.
Contents:
Same directory structure as gencode_v47_experiments.zip:
beta_vae_contrastive_g49/beta_vae_features_attn_g49/beta_vae_features_g49/cnn_g49/stat_results/ablations_v49/andstat_results/g49/
Reproducibility
To reproduce the results: Refer to the code repository (DOI: 10.5281/zenodo.18833347) for scripts
The split_manifest.json files document the exact train/test splits used.
Citation
If you use this dataset, please cite:
[Author list]. (2026). [Manuscript title].
Bioinformatics. DOI: [DOI when available]
Dataset DOI: 10.5281/zenodo.18849718
Code DOI: 10.5281/zenodo.18833347
License
CC BY 4.0
Contact
For questions or issues regarding this dataset, please contact:
- Mikaël Georges: mikael.georges@ibgc.cnrs.fr | Macha Nikolski macha.nikolski@u-bordeaux.fr
- Or open an issue on the GitHub repository: https://github.com/cbib/beta_vae_lnclassifier
Last Updated: 03/03/26
Version: 1.0.0
Files
data.zip
Additional details
Related works
- Is supplement to
- Software: 10.5281/zenodo.18833347 (DOI)
Software
- Repository URL
- https://github.com/cbib/beta_vae_lnclassifier
- Programming language
- Python