Published December 23, 2024
| Version v1
Computational notebook
Open
Deep Learning code for "CNSistent integration and feature extraction from somatic copy number profiles"
Description
Description
Accompanying code for the article "CNSistent integration and feature extraction from somatic copy number profiles". This repository contains code for machine learning of the data produced by the CNSistent tool: https://bitbucket.org/schwarzlab/cnsistent which is required to provide the source data.
Repository structure
Notebooks
analyze_lung.inpyb: Analyzes the results of the lung classification and displays misclassified samples.eval_models.ipynb: Reads training results fromresfolder to compare models on individual datasets.eval_features.ipynb: Reads training results fromresfolder and produces summary tables and plots.integrated_gradients_bins.ipynb: Reads a model and calculates integrated gradients using the CAPTUM library on fixed size bins.integrated_gradients_genes.ipynb: Reads a model and calculates integrated gradients using the CAPTUM library on COSMIC gene set.fit_random_tree_forest.ipynb: Fits a random forest model on the output of the dataset for comparison.fit_umap.ipynb: Fits UMAP on the output of the dataset for comparison.
Code
cncc/models.py: Contains the PyTorch models definitions.cncc/torch_utils.py: Functions for seeding, training, testing, and scoring PyTorch runs.cncc/utils.py: Functions for IO and data formattingtrain.py: Main training script, see below for use.test.py: Main testing script, see below for use.
Other
-
cncc.yaml: The conda environment used to run this repository. -
train_all.ps1: Batch script to run all ML processes used in the article.
Results
The `res` folder contains results of all optimizations with two files per run:
- `*_cm.tsv` is the resulting confusion matrix, analyzed in
eval_features.ipynbnotebook. - `*_res.tsv` is the optimization statistic, analyzed in
eval_models.ipynbnotebook.
Usage
python train.py
options:
--feature FEATURE Feature type (default: 1MB)
--dataset DATASET Dataset name (default: all), options: ['PCAWG', 'TRACERx', 'TCGA_hg19', 'all']
--selection SELECTION Type selection (default: top_6), options: ['lung', 'top_{i}']
--model MODEL Model type (default: conv)
--seed SEED Random seed (default: 0)
--folds K Number of folds for K-fold cross-validation (default: 5)
--batch BATCH Batch size (default: 64)
--subsample bool Whether or not to subsample the dataset (default: False)
python test.py
options:
--feature FEATURE Feature type (default: 1MB)
--train TRAIN Training dataset name (if different from test) (default: )
--test TEST Evaluation dataset name (default: "all")
--selection SELECTION Type selection (default: top_6), options: ['lung', 'top_{i}']
--model MODEL Model type (default: conv)
--seed SEED Random seed (default: 0)
--folds K Number of folds for K-fold cross-validation (default: 5)
--batch BATCH Batch size (default: 64)
--subsample bool Whether or not to subsample the dataset (default: False)
Files
project-cnsistent.zip
Files
(4.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:5abe0876c9b062597617411bfa540f18
|
4.7 MB | Preview Download |