circ-EnviroPredict: A machine learning-based tool to predict potential involvement of circRNAs with cold and drought stress through a Word2Vec approach - datasets
Authors/Creators
Contributors
Researcher (3):
Description
circ-EnviroPredict Zenodo Repository
Abstract
This repository contains the primary datasets used in the development and validation of circ-EnviroPredict, a machine learning tool designed to predict circular RNA (circRNA) involvement in plant abiotic stress conditions (cold and drought). The repository includes raw genomic sequences for vocabulary construction, labeled database records, processed Word2Vec numerical embeddings based on k-mer segmentation, approximate nearest neighbor search results for sequence similarity analysis, and independent cross-species datasets for external model validation.
Dataset Directory Structure
- neighbors_results_cold_rice.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between rice circRNAs under cold stress and control conditions.
- neighbors_results_drought.xlsx: Tabular data containing the 5 nearest neighbor search results evaluating sequence similarity between circRNAs under drought stress and control conditions.
- maize_db.xlsx & rice_db.xlsx: Labeled database records containing circRNA annotations, environmental condition classifications (control, cold, drought), and metadata extracted from CropCircDB.
- osaj43883_genomic_seq.txt & zma10381_genomic_seq.txt: Genomic sequences in FASTA format for 43,883 rice (Oryza sativa) circRNAs and 10,381 maize (Zea mays) circRNAs, used as the text corpus for Word2Vec vocabulary construction.
- validation_seq_arabidopsis.txt: Independent test set sequences for Arabidopsis thaliana.
- validation_seq_soybean.txt: Independent test set sequences for Glycine max.
- validation_seq_t_aestivum.txt: Independent test set sequences for Triticum aestivum.
- validation_seq_maize.txt: Supplementary validation sequence set for maize.
- validation_seq_control.txt: Unstressed baseline sequences utilized for external model validation.
- maize_w2vec_3mer_64_dataset.xlsx: Numerically encoded maize dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space.
- rice_w2vec_3mer_64_dataset.xlsx: Numerically encoded rice dataset used for machine learning training and testing. Sequences are transformed into features via Word2Vec utilizing 3-mer segmentation and a 64-dimensional vector space.
- file.txt: General repository documentation or unstructured textual data.
Files
circ-enviropredict_data.zip
Files
(248.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:d58085e12736fe08b733d92e88dd283f
|
248.0 MB | Preview Download |