Code and data for "Accurate prediction of gene deletion phenotypes with Flux Cone Learning"
Authors/Creators
- 1. The University of Edinburgh
Contributors
Contact persons:
Project member:
Description
Description
Code and data for paper "Accurate prediction of gene deletion phenotypes with Flux Cone Learning" by Merzbacher et al, 2025. This repository contains code for training and evaluating machine learning models to predict the effects of gene deletions in different organisms. Files contain the training datasets created by large-scale flux sampling of GEMs (data.zip) and all training scripts and Jupyter notebooks with code for creating all paper figures (deletionprediction-main.zip).
Repository Structure
Environment and Dependencies
The `environment.yml` file contains all dependencies needed to run the code. It can be installed in a fresh Conda environment using the following command: `conda env create -f environment.yml`.
Figure Creation Notebooks
The data and code to create all figures in the paper "Accurate prediction of gene deletion phenotypes with Flux Cone Learning" is organized by figure panel. Some figures require multiple data files; Figures 2A and B are created from the same dataset.
Training Scripts
- `training/ecoli_training.py`: Trains RandomForest models for E. coli essentiality classification using train/test splits. Note that this script trains models on all reactions in a given GEM, not only the shared reactions as in Figure 1F.
- `training/yeast_training_esseniality.py`: Trains RandomForest models for yeast essentiality classification using k-fold cross validation and hyperparameter optimization
- `training/yeast_training_production.py`: Trains multiple ML models (HistGradientBoosting, LinearSVC, LogisticRegression, RandomForest) for yeast production classification with balanced and resampled variants
- `training/cho_training.py`: Trains HistGradientBoosting models for CHO cell essentiality classification using k-fold cross validation and hyperparameter optimization
Data
The training data should be placed in the `data/` directory once unzipped and includes:
- CHO cell data: `cho_essential_full.npz`, `cho_nonessential_full.npz`
- E. coli data: Model-specific files like `iml1515_1x_essential.npz`
- Yeast data: `yeast_single_knockouts.npz`
The `data/` directory also contains several CSV files used for data splits and analysis:
- `groundtruthcomplete.csv`: Includes the ground truth labels for E. coli gene essentiality.
- `yeast_essentiality_test_split.csv`: Defines the test set for yeast data by marking knockouts as test/non-test
- `yeast_production_test_split.csv`: Lists knockout names designated for the test set in yeast production prediction
- `yeast_production_validation_split.csv`: Contains k-fold validation splits for yeast production data, with knockout names and fold assignments
- `cho_essentiality_validation_split.csv`: Contains k-fold validation splits for CHO cell data, with columns for knockout names and fold assignments
- `cho_essentiality_test_split.csv`: Defines the test set for CHO cell data by marking knockouts as test/non-test
Model Parameters
The training scripts accept various command line arguments to configure:
- Number of training repeats (number of times model is trained)
- Number of training folds (number of k-fold splits of training set)
- Test set split percentage
- Model hyperparameters (learning rate, tree depth, etc.)
- Paths for saving models and results
Usage
Example usage for training an E. coli model:
Basic usage with default parameters
`python ecoli_training.py --model 'iml1515' --savepath 'results/' --repeats 5 --test_split 0.2`
The script will:
1. Load E. coli knockout data for iml1515 model
2. Split data into train/test sets (20% test)
3. Train a RandomForest model 5 times with different random splits
4. Save models and results to models/ directory
Example usage for training a yeast essentiality model:
Basic usage with default parameters:
`python yeast_training_essentiality.py --savepath 'results/'`
Full parameter specification:
```shell
python yeast_training_essentiality.py \
--savepath 'results/' \
--repeats 5 \
--folds 5 \
--grid_search True \
--max_depth 10 \
--n_estimators 200 \
--min_samples_split 5
```
The script will:
1. Load yeast knockout data
2. Split data into train/test sets based on predefined splits
3. Train RandomForest models with specified parameters
4. Evaluate using k-fold cross validation
5. Save models and results to specified path
For hyperparameter tuning:
```shell
python yeast_training_essentiality.py \
--savepath 'tuning/' \
--grid_search True \
--repeats 3
```
Example usage for training a yeast production model suite:
Basic usage with default parameters:
`python yeast_training_production.py --savepath 'results/'`
Full parameter specification:
```shell
python yeast_training_production.py \
--savepath 'results/' \
--repeats 5 \
--folds 5 \
--test_split 0.2
```
The script will:
1. Load yeast knockout data and production values
2. Preprocess data by removing NaNs and scaling production values
3. Bin production into 3 classes (low/medium/high)
4. Train multiple ML models with k-fold cross validation
5. Save models and results to specified path
Example usage for training a CHO model:
Basic usage with default parameters
`python cho_training.py --savepath 'results/' --max_iter 100 --learning_rate 0.1`
Full parameter specification
```shell
python cho_training.py \
--savepath 'results/' \
--repeats 5 \
--test_split 0.2 \
--num_samples 100 \
--downsample 1 \
--max_depth 10 \
--max_iter 100 \
--learning_rate 0.1
```
For hyperparameter tuning
```shell
python cho_training.py \
--savepath tuning/ \
--max_depth 5 \
--max_iter 50 \
--learning_rate 0.05
```
The script will:
1. Load CHO cell knockout data
2. Split data into train/test sets
3. Train a HistGradientBoosting model with specified parameters
4. Evaluate using k-fold cross validation
5. Save model and results to specified path
Technical info
Version 2 includes code to produce essentiality ROC curves from FBA and FCL predictions.
Files
data.zip
Files
(23.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:e6f0f188ec032f2cee662a48f8572576
|
23.2 GB | Preview Download |
|
md5:f438835741dfc0842fd7f2c432fdc823
|
49.2 MB | Preview Download |
Additional details
Software
- Programming language
- Python