Predicting isocitrate dehydrogenase mutation status in acute myeloid leukemia from gene expression profiles by machine learning
Authors/Creators
- 1. Université Grenoble Alpes, INSERM U1209, CNRS UMR 5309, Institute for Advanced Biosciences, Grenoble, France
- 2. Université Grenoble Alpes, CHU Grenoble Alpes, Department of Hematology, Grenoble, France
Description
Abstract
We developed machine-learning models to predict isocitrate dehydrogenase (IDH) mutation status in acute myeloid leukemia (AML) from gene expression profiles and to reconstruct missing IDH annotations across public datasets. Transcriptomic data from 19 cohorts (5844 samples) were harmonized using batch correction, and 1546 samples with known IDH status were used to train a feed-forward neural network and a logistic regression (LR) classifier within a nested cross-validation framework, followed by independent validation in TCGA-LAML. The LR model showed superior performance, achieving ROC-AUC = 0.994 ± 0.007, accuracy = 0.983 ± 0.006, balanced accuracy = 0.979 ± 0.005, sensitivity for the IDH-mutant class = 0.972 ± 0.010, and specificity = 0.986 ± 0.008, and correctly classified all IDH-mutant cases in the independent cohort. Applying the final model to samples lacking annotations enabled reconstruction of IDH status for 4148 AML cases, expanding the number of molecularly characterized transcriptomes available for downstream analyses. Predicted groups recapitulated known IDH-associated transcriptional signatures, supporting biological validity. This work demonstrates that IDH mutation status can be accurately inferred from transcriptomic data alone and provides a scalable framework to recover missing genomic annotations, thereby enhancing the utility of public AML resources for large-scale biological and translational research.
Technical info
📌 Overview
Isocitrate dehydrogenase (IDH) mutations are key molecular events in acute myeloid leukemia (AML), but mutation annotations are often missing in public transcriptomic datasets.
This project presents a scalable computational framework that:
* Integrates and batch-corrects transcriptomic data from 19 AML cohorts (5,844 samples)
* Trains machine-learning models (logistic regression and neural network) to predict IDH mutation status
* Reconstructs missing IDH annotations for 4,148 previously unannotated samples
* Provides predicted labels and reproducible code for community reuse
The final logistic regression model achieves high discrimination performance and is used to generate the predicted annotations included in this repository.
📌 Data Description
1. Pooled Expression Dataset
File: expression_data_pooled_19_AML_datasets_pycombat_corrected.csv
Description: Batch-corrected gene expression matrix obtained by integrating 19 AML transcriptomic cohorts using pyComBat.
Details:
* Samples: 5,844
* Genes: 9,870
* Format: rows = samples, columns = genes
* Purpose: input features for machine-learning models
2. IDH Mutation Status Annotations
File: idh_mutation_status.csv
Description: Ground-truth IDH mutation labels for samples where annotations are available.
Columns:
`idh_status` → categorical label
* `IDH-WT`
* `IDH-MUT`
* `None` (unknown)
`idh_mutant` → binary encoding
* `0` = IDH-WT
* `1` = IDH-MUT
* `None` = unknown
Coverage: Available for 1,696 out of 5,844 samples
3. Predicted IDH Status
📌 Machine Learning Models
Logistic Regression (LR)
Neural Network (NN)
📌 Code Description
Hyperparameter Optimization
Model Training and Prediction
Files
Jung_et_al.zip
Files
(492.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f6fed26d23e03b97dd796088bccc64bf
|
492.2 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/epimed/aml-idh-predict
- Programming language
- Python