circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets

Martins Ferreira, Maria Clara; Schmitt Kremer, Frederico; galli, vanessa

doi:10.5281/zenodo.19372429

Published April 1, 2026 | Version v1

Dataset Open

circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets

1. Universidade Federal de Pelotas

Contributors

Researcher (3):

1. universidade federal de pelotas
2. Universidade Federal de Pelotas

circ-EnviroPredict Zenodo Repository

Abstract

This repository contains the primary datasets used in the development and validation of circ-EnviroPredict, a machine learning tool designed to predict circular RNA (circRNA) involvement in plant abiotic stress conditions (cold and drought). The repository includes raw genomic sequences for vocabulary construction, labeled database records, processed Word2Vec numerical embeddings based on k-mer segmentation, approximate nearest neighbor search results for sequence similarity analysis, and independent cross-species datasets for external model validation.

Dataset Directory Structure

1. annoy/ (Approximate Nearest Neighbors Analysis)

neighbors_results_cold_rice.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between rice circRNAs under cold stress and control conditions.
neighbors_results_drought.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between circRNAs under drought stress and control conditions.

2. raw/ (Primary Sequence Data and Metadata)

maize_db.xlsx & rice_db.xlsx: Labeled database records containing circRNA annotations, environmental condition classifications (control, cold, drought), and metadata extracted from CropCircDB.
osaj43883_genomic_seq.txt & zma10381_genomic_seq.txt: Genomic sequences in FASTA format for 43,883 rice (Oryza sativa) circRNAs and 10,381 maize (Zea mays) circRNAs, used as the text corpus for Word2Vec vocabulary construction.

3. sample_validation/ (Cross-Species Validation Sets)

validation_seq_arabidopsis.txt: Independent test set sequences for Arabidopsis thaliana.
validation_seq_soybean.txt: Independent test set sequences for Glycine max.
validation_seq_t_aestivum.txt: Independent test set sequences for Triticum aestivum.
validation_seq_maize.txt: Supplementary validation sequence set for maize.
validation_seq_control.txt: Unstressed baseline sequences utilized for external model validation.

4. word2vec_datasets/ (Engineered Feature Sets)

maize_w2vec_3mer_64_dataset.xlsx: Numerically encoded maize dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space.
rice_w2vec_3mer_64_dataset.xlsx: Numerically encoded rice dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space.

5. Root Directory Files

file.txt: General repository documentation or unstructured textual data.

Files

circ-enviropredict_data.zip

Files (248.0 MB)

Name	Size	Download all
circ-enviropredict_data.zip md5:d58085e12736fe08b733d92e88dd283f	248.0 MB	Preview Download

	All versions	This version
Views	16	16
Downloads	6	6
Data volume	2.2 GB	2.2 GB

circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets

Authors/Creators

Contributors

Researcher (3):

Description

circ-EnviroPredict Zenodo Repository

Abstract

Dataset Directory Structure

Files

circ-enviropredict_data.zip

Files (248.0 MB)