There is a newer version of the record available.

Published March 3, 2026 | Version 1.0.0
Dataset Open

Dataset: Interpretable multimodal learning from sequence and genomic context for lncRNA classification

Authors/Creators

Description

This dataset contains all data files and experimental results associated with the manuscript "Interpretable multimodal learning from sequence and
genomic context for lncRNA classification".

Related Code Repository: https://github.com/cbib/beta_vae_lnclassifier

Code Archive DOI: 10.5281/zenodo.18833347

Manuscript: [Citation when available]

File Organization

The dataset is organized into three ZIP archives:

  1. data.zip - Input data files and preprocessing outputs
  2. gencode_v47_experiments.zip - All experimental results on GENCODE v47
  3. gencode_v49_experiments.zip - All experimental results on GENCODE v49

1. data.zip

Contains all input sequences, features, and lncRNA-BERT baseline results.

Contents:

cdhit_clusters/

CD-HIT clustered transcript sequences for training:

  • v47_lncRNA_clustered.fa - CD-HIT clustered lncRNA sequences (v47)
  • v47_pc_clustered.fa - CD-HIT clustered protein-coding sequences (v47)
  • v49_lncRNA_clustered.fa - CD-HIT clustered lncRNA sequences (v49)
  • v49_pc_clustered.fa - CD-HIT clustered protein-coding sequences (v49)

dataset_biotypes/

Biotype annotations for datasets:

  • v47_dataset_biotypes_cdhit.csv - Transcript biotype labels (v47)
  • v49_dataset_biotypes_cdhit.csv - Transcript biotype labels (v49)

Format: CSV with columns including transcript_id, biotype, gene_id

lncRNABERT_results/

Zero-shot baseline results from lncRNA-BERT:

  • v47_lncRNABERT_embeddings.h5 - Learned embeddings (v47)
  • v47_lncRNABERT_results.csv - Predictions and metrics (v47)
  • v49_lncRNABERT_embeddings.h5 - Learned embeddings (v49)
  • v49_lncRNABERT_results.csv - Predictions and metrics (v49)

processed_features/

Cleaned and normalized feature vectors with associated metadata:

  • v47_nonb_feature_names.txt - Non-B DNA feature names (v47)
  • v47_nonb_features_clean.csv - Processed non-B DNA features (v47)
  • v47_nonb_scaler.pkl - Scikit-learn scaler for non-B features (v47)
  • v47_te_feature_names.txt - TE feature names (v47)
  • v47_te_features_clean.csv - Processed TE features (v47)
  • v47_te_scaler.pkl - Scikit-learn scaler for TE features (v47)
  • v49_nonb_feature_names.txt - Non-B DNA feature names (v49)
  • v49_nonb_features_clean.csv - Processed non-B DNA features (v49)
  • v49_nonb_scaler.pkl - Scikit-learn scaler for non-B features (v49)
  • v49_te_feature_names.txt - TE feature names (v49)
  • v49_te_features_clean.csv - Processed TE features (v49)
  • v49_te_scaler.pkl - Scikit-learn scaler for TE features (v49)

Description: Feature scalers (.pkl) can be loaded with scikit-learn to apply the same normalization used during training.

split_gencode_47/

Train/test split for GENCODE v47:

  • lnc_test.fa - lncRNA test set
  • lnc_trainval.fa - lncRNA training+validation set
  • pc_test.fa - Protein-coding test set
  • pc_trainval.fa - Protein-coding training+validation set
  • split_manifest.json - Split metadata and statistics

split_gencode_49/

Train/test split for GENCODE v49 (same structure as split_gencode_47/)

2. gencode_v47_experiments.zip

Experimental results for all models trained and evaluated on GENCODE v47.

Contents:

beta_vae_contrastive_g47/

β-VAE with contrastive learning (sequence-only baseline):

  • evaluation_csvs/ - Evaluation metrics and predictions
  • global_biotype_enrichment/ - Biotype enrichment analysis
  • models/ - Model checkpoints
  • performance_figures/ - Performance visualization plots
  • spatial_analysis/ - Spatial clustering analysis
  • umap_visualizations/ - UMAP embedding visualizations
  • ANALYSIS_SUMMARY.md - Summary of key findings
  • biotype_mapping.json - Biotype label mappings
  • cv_evaluation_results.json - Cross-validation results
  • cv_fold_results.csv - Per-fold cross-validation metrics
  • embeddings_all_folds.npz - Concatenated embeddings from all CV folds
  • embeddings_best_fold.npz - Embeddings from best performing fold
  • model_architecture.txt - Model architecture description
  • model_paths.csv - Paths to saved model files
  • test_results.json - Final test set results

beta_vae_features_attn_g47/

β-VAE with attention-based feature fusion (TE + non-B DNA): Same structure as beta_vae_contrastive_g47/

beta_vae_features_g47/

β-VAE with concatenated features (TE + non-B DNA): Same structure as beta_vae_contrastive_g47/

cnn_g47/

CNN baseline (sequence-only): Same structure as beta_vae_contrastive_g47/

stat_results/

Statistical analysis results across all models:

ablations_v47/

Ablation study results:

  • bootstrap_f1_ci.csv - Bootstrap confidence intervals for F1 scores
  • delongauc_ci.csv - DeLong test for AUC comparisons
  • fold_summary.csv - Summary statistics per fold
g47/

GENCODE v47 statistical analysis:

  • g47_bootstrap_f1_ci.csv - Bootstrap F1 confidence intervals
  • g47_fold_summary.csv - Per-fold summary statistics
  • hardcase_jaccard_pairwise_v47.csv - Jaccard similarity for hard cases
  • hardcase_jaccard_v47.csv - Hard case Jaccard indices
  • hardcase_membership_long_v47.csv - Hard case membership matrix
  • hardcase_upset_v47.png - UpSet plot for hard case overlaps

3. gencode_v49_experiments.zip

Experimental results for all models trained and evaluated on GENCODE v49.

Contents:

Same directory structure as gencode_v47_experiments.zip:

  • beta_vae_contrastive_g49/
  • beta_vae_features_attn_g49/
  • beta_vae_features_g49/
  • cnn_g49/
  • stat_results/ablations_v49/ and stat_results/g49/

Reproducibility

To reproduce the results: Refer to the code repository (DOI: 10.5281/zenodo.18833347) for scripts

The split_manifest.json files document the exact train/test splits used.

Citation

If you use this dataset, please cite:

[Author list]. (2026). [Manuscript title]. 
Bioinformatics. DOI: [DOI when available]

Dataset DOI: 10.5281/zenodo.18849718
Code DOI: 10.5281/zenodo.18833347

License

 CC BY 4.0

Contact

For questions or issues regarding this dataset, please contact:

  • Mikaël Georges: mikael.georges@ibgc.cnrs.fr | Macha Nikolski macha.nikolski@u-bordeaux.fr
  • Or open an issue on the GitHub repository: https://github.com/cbib/beta_vae_lnclassifier

Last Updated: 03/03/26
Version: 1.0.0

Files

data.zip

Files (8.3 GB)

Name Size Download all
md5:550f78df9c996941bc3ac83380bbf3da
1.2 GB Preview Download
md5:42cc14ca7423c7ac38e5868b552d4b07
3.6 GB Preview Download
md5:c10209e9f0ac9d1dd89211766a816763
3.4 GB Preview Download

Additional details

Related works

Is supplement to
Software: 10.5281/zenodo.18833347 (DOI)

Software

Repository URL
https://github.com/cbib/beta_vae_lnclassifier
Programming language
Python