Published April 1, 2026 | Version v1
Dataset Open

circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets

  • 1. universidade federal de pelotas
  • 2. ROR icon Universidade Federal de Pelotas

Description

circ-EnviroPredict Zenodo Repository

Abstract

This repository contains the primary datasets used in the development and validation of circ-EnviroPredict, a machine learning tool designed to predict circular RNA (circRNA) involvement in plant abiotic stress conditions (cold and drought). The repository includes raw genomic sequences for vocabulary construction, labeled database records, processed Word2Vec numerical embeddings based on k-mer segmentation, approximate nearest neighbor search results for sequence similarity analysis, and independent cross-species datasets for external model validation.

Dataset Directory Structure

1. annoy/ (Approximate Nearest Neighbors Analysis)
  • neighbors_results_cold_rice.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between rice circRNAs under cold stress and control conditions.
  • neighbors_results_drought.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between circRNAs under drought stress and control conditions.
2. raw/ (Primary Sequence Data and Metadata)
  • maize_db.xlsx & rice_db.xlsx: Labeled database records containing circRNA annotations, environmental condition classifications (control, cold, drought), and metadata extracted from CropCircDB.
  • osaj43883_genomic_seq.txt & zma10381_genomic_seq.txt: Genomic sequences in FASTA format for 43,883 rice (Oryza sativa) circRNAs and 10,381 maize (Zea mays) circRNAs, used as the text corpus for Word2Vec vocabulary construction.
3. sample_validation/ (Cross-Species Validation Sets)
  • validation_seq_arabidopsis.txt: Independent test set sequences for Arabidopsis thaliana.
  • validation_seq_soybean.txt: Independent test set sequences for Glycine max.
  • validation_seq_t_aestivum.txt: Independent test set sequences for Triticum aestivum.
  • validation_seq_maize.txt: Supplementary validation sequence set for maize.
  • validation_seq_control.txt: Unstressed baseline sequences utilized for external model validation.
4. word2vec_datasets/ (Engineered Feature Sets)
  • maize_w2vec_3mer_64_dataset.xlsx: Numerically encoded maize dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space.
  • rice_w2vec_3mer_64_dataset.xlsx: Numerically encoded rice dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space.
5. Root Directory Files
  • file.txt: General repository documentation or unstructured textual data.

Files

circ-enviropredict_data.zip

Files (248.0 MB)

Name Size Download all
md5:d58085e12736fe08b733d92e88dd283f
248.0 MB Preview Download