Published June 18, 2024 | Version v7
Software Open

YangL-Coder/mRCat: mRCat

Creators

Description

mRCat

Title:mRCat: a Novel CatBoost Predictor for Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features

The subcellular localization of messenger RNAs (mRNAs) is a pivotal aspect of biomolecules, tightly linked to gene regulation and protein synthesis, and offers innovative insights into disease diagnosis and drug development in the field of biomedicine. we propose a new machine learning-based subcellular localization predictor of mRNAs, named mRCat. This predictor initially harnesses large language models to deeply explore the implicit information within the sequence. It then amalgamates traditional sequence characteristics for a comprehensive portrayal of mRNAs gene sequences. Ultimately, it utilizes the CatBoost as the foundational classifier to predict the subcellular localization of mRNAs. The experimental validation on an independent test set demonstrates that mRCat obtains Accuracy of 0.761, F1 score of 0.710, MCC of 0.511, AUROC of 0.751. The results indicate that our method has higher accuracy and robustness compared to other state-of-the-art methods. It is anticipated to offer deeper insights into biomolecular research .

mRCat uses the following dependencies:

python: 3.9

pandas: 1.4.4

sklearn: 1.0.2

numpy: 1.21.5

matplotlib: 3.4.3

Guiding principles:

1.The folder "data" contains the datasets used in this study, including Training and Testing Set.The mRNA_sublocation_TrainingSet_NC-BERTdata.csv file in the TrainingSet folder is the sequence training set features. The mRNA_sublocation_TestSet_NC-BERTdata.csv file in the TestSet folder is the sequence test set features. The remaining csv files are the process data files for constructing the training set and the test set.

2.The folder "result" contains the results produced by the experimental process in this study.

3.IPYNB file, CV_violin_plots, DocProcess is the code for data preprocessing and visualization, mRNA_sublocation_TestSet_EIIP, mRNA_sublocation_TestSet-DNABERT, mRNA_sublocation_TrainingSet-DNABERT, mRNA_sublocation_TrainingSet_EIIP are the codes for feature extraction of sequences,and the remaining IPYNB files focus on training models by combining various features with different base classifiers.

4.When using mRNA_sublocation_TrainingSet_DNABERT_data.csv and mRNA_sublocation_TrainingSet_NC-BERTdata.csv, you need to extract the CSV files from the mRNA_sublocation_TrainingSet_DNABERT_data.zip and mRNA_sublocation_TrainingSet_NC-BERTdata.zip archives into the data/TrainingSet folder.

Note:

This code is for the article 'mRCat: a Novel CatBoost Predictor for Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features'.

Files

README.md

Files (72.6 MB)

Name Size Download all
md5:c01d3933ab71e03d0e519bb3e1a1415e
4.2 kB Preview Download
md5:20cced7946c8421bdf9f3129f5d860fb
72.6 MB Preview Download

Additional details

Related works