Published April 5, 2026 | Version v1
Journal article Open

Predicting isocitrate dehydrogenase mutation status in acute myeloid leukemia from gene expression profiles by machine learning

  • 1. Université Grenoble Alpes, INSERM U1209, CNRS UMR 5309, Institute for Advanced Biosciences, Grenoble, France
  • 2. Université Grenoble Alpes, CHU Grenoble Alpes, Department of Hematology, Grenoble, France

Description

This repository accompanies the research article:

Predicting isocitrate dehydrogenase mutation status in acute myeloid leukemia from gene expression profiles by machine learning.

It provides harmonized transcriptomic data, mutation annotations, predicted labels, and Python scripts required to reproduce the analyses and train machine-learning models described in the study.

Abstract

We developed machine-learning models to predict isocitrate dehydrogenase (IDH) mutation status in acute myeloid leukemia (AML) from gene expression profiles and to reconstruct missing IDH annotations across public datasets. Transcriptomic data from 19 cohorts (5844 samples) were harmonized using batch correction, and 1546 samples with known IDH status were used to train a feed-forward neural network and a logistic regression (LR) classifier within a nested cross-validation framework, followed by independent validation in TCGA-LAML. The LR model showed superior performance, achieving ROC-AUC = 0.994 ± 0.007, accuracy = 0.983 ± 0.006, balanced accuracy = 0.979 ± 0.005, sensitivity for the IDH-mutant class = 0.972 ± 0.010, and specificity = 0.986 ± 0.008, and correctly classified all IDH-mutant cases in the independent cohort. Applying the final model to samples lacking annotations enabled reconstruction of IDH status for 4148 AML cases, expanding the number of molecularly characterized transcriptomes available for downstream analyses. Predicted groups recapitulated known IDH-associated transcriptional signatures, supporting biological validity. This work demonstrates that IDH mutation status can be accurately inferred from transcriptomic data alone and provides a scalable framework to recover missing genomic annotations, thereby enhancing the utility of public AML resources for large-scale biological and translational research. 

Technical info

📌 Overview

Isocitrate dehydrogenase (IDH) mutations are key molecular events in acute myeloid leukemia (AML), but mutation annotations are often missing in public transcriptomic datasets.

This project presents a scalable computational framework that:

* Integrates and batch-corrects transcriptomic data from 19 AML cohorts (5,844 samples)
* Trains machine-learning models (logistic regression and neural network) to predict IDH mutation status
* Reconstructs missing IDH annotations for 4,148 previously unannotated samples
* Provides predicted labels and reproducible code for community reuse

The final logistic regression model achieves high discrimination performance and is used to generate the predicted annotations included in this repository.

 

📌 Data Description

1. Pooled Expression Dataset

File: expression_data_pooled_19_AML_datasets_pycombat_corrected.csv

Description: Batch-corrected gene expression matrix obtained by integrating 19 AML transcriptomic cohorts using pyComBat.

Details:

* Samples: 5,844
* Genes: 9,870
* Format: rows = samples, columns = genes
* Purpose: input features for machine-learning models

2. IDH Mutation Status Annotations

File: idh_mutation_status.csv

Description: Ground-truth IDH mutation labels for samples where annotations are available.

Columns:

`idh_status` → categorical label

  * `IDH-WT`
  * `IDH-MUT`
  * `None` (unknown)  

`idh_mutant` → binary encoding

  * `0` = IDH-WT
  * `1` = IDH-MUT
  * `None` = unknown

Coverage: Available for 1,696 out of 5,844 samples

3. Predicted IDH Status

File: predicted_IDH_status_by_LR_for_19_AML_datasets.csv
 
Description: Final IDH status assignments produced by the logistic regression model.
 
Labels:
 
* `IDH-MUT` → confirmed mutant
* `IDH-WT` → confirmed wildtype
* `pIDH-MUT` → predicted mutant
* `pIDH-WT` → predicted wildtype
 
Columns:
 
* `Sample`
* `Dataset`
* `Known IDH status`
* `Predicted IDH status`
 
This file provides the reconstructed annotations for all 19 datasets.
 

📌 Machine Learning Models

Two supervised models are implemented:

Logistic Regression (LR)

* Final selected model
* L2 regularization
* High interpretability and robustness
* Used for final predictions in this repository
 

Neural Network (NN)

* Feed-forward multilayer perceptron
* Included for comparison and reproducibility
 

📌 Code Description

Hyperparameter Optimization

grid_search_LR.py
 
Performs nested cross-validation grid search to identify optimal logistic regression hyperparameters.
 
grid_search_NN.py
 
Performs hyperparameter search for neural network architecture and training parameters.

Model Training and Prediction

train_LR.py
 
* Trains the logistic regression model using selected hyperparameters
* Generates predictions for new samples
 
train_NN.py
 
* Trains the neural network model
* Outputs probability predictions and class labels

Files

Jung_et_al.zip

Files (492.2 MB)

Name Size Download all
md5:f6fed26d23e03b97dd796088bccc64bf
492.2 MB Preview Download

Additional details

Software

Repository URL
https://github.com/epimed/aml-idh-predict
Programming language
Python