Deep Learning code for "CNSistent integration and feature extraction from somatic copy number profiles"

Streck, Adam

doi:10.5281/zenodo.14546762

Published December 23, 2024 | Version v1

Computational notebook Open

Deep Learning code for "CNSistent integration and feature extraction from somatic copy number profiles"

Streck, Adam (Researcher)¹

1. University Hospital Cologne

Contributors

Supervisor:

Schwarz, Roland¹

1. University Hospital Cologne

Description

Accompanying code for the article "CNSistent integration and feature extraction from somatic copy number profiles". This repository contains code for machine learning of the data produced by the CNSistent tool: https://bitbucket.org/schwarzlab/cnsistent which is required to provide the source data.

Repository structure

Notebooks

analyze_lung.inpyb: Analyzes the results of the lung classification and displays misclassified samples.
eval_models.ipynb: Reads training results from res folder to compare models on individual datasets.
eval_features.ipynb: Reads training results from res folder and produces summary tables and plots.
integrated_gradients_bins.ipynb: Reads a model and calculates integrated gradients using the CAPTUM library on fixed size bins.
integrated_gradients_genes.ipynb: Reads a model and calculates integrated gradients using the CAPTUM library on COSMIC gene set.
fit_random_tree_forest.ipynb: Fits a random forest model on the output of the dataset for comparison.
fit_umap.ipynb: Fits UMAP on the output of the dataset for comparison.

Code

cncc/models.py: Contains the PyTorch models definitions.
cncc/torch_utils.py: Functions for seeding, training, testing, and scoring PyTorch runs.
cncc/utils.py: Functions for IO and data formatting
train.py: Main training script, see below for use.
test.py: Main testing script, see below for use.

Other

cncc.yaml: The conda environment used to run this repository.
train_all.ps1: Batch script to run all ML processes used in the article.

Results

The `res` folder contains results of all optimizations with two files per run:

`*_cm.tsv` is the resulting confusion matrix, analyzed in eval_features.ipynb notebook.
`*_res.tsv` is the optimization statistic, analyzed in eval_models.ipynb notebook.

Usage

python train.py

options:

--feature FEATURE Feature type (default: 1MB)

--dataset DATASET Dataset name (default: all), options: ['PCAWG', 'TRACERx', 'TCGA_hg19', 'all']

--selection SELECTION Type selection (default: top_6), options: ['lung', 'top_{i}']

--model MODEL Model type (default: conv)

--seed SEED Random seed (default: 0)

--folds K Number of folds for K-fold cross-validation (default: 5)

--batch BATCH Batch size (default: 64)

--subsample bool Whether or not to subsample the dataset (default: False)

python test.py

options:

--feature FEATURE Feature type (default: 1MB)

--train TRAIN Training dataset name (if different from test) (default: )

--test TEST Evaluation dataset name (default: "all")

--selection SELECTION Type selection (default: top_6), options: ['lung', 'top_{i}']

--model MODEL Model type (default: conv)

--seed SEED Random seed (default: 0)

--folds K Number of folds for K-fold cross-validation (default: 5)

--batch BATCH Batch size (default: 64)

--subsample bool Whether or not to subsample the dataset (default: False)

Files

project-cnsistent.zip

Files (4.7 MB)

Name	Size	Download all
project-cnsistent.zip md5:5abe0876c9b062597617411bfa540f18	4.7 MB	Preview Download

	All versions	This version
Views	107	69
Downloads	37	30
Data volume	203.8 MB	141.1 MB

Deep Learning code for "CNSistent integration and feature extraction from somatic copy number profiles"

Creators

Contributors

Supervisor:

Description

Description

Repository structure

Notebooks

Code

Other

Results

Usage

Files

project-cnsistent.zip

Files (4.7 MB)