There is a newer version of the record available.

Published December 23, 2024 | Version v1
Computational notebook Open

Deep Learning code for "CNSistent integration and feature extraction from somatic copy number profiles"

  • 1. ROR icon University Hospital Cologne

Contributors

Supervisor:

  • 1. ROR icon University Hospital Cologne

Description

Description

Accompanying code for the article "CNSistent integration and feature extraction from somatic copy number profiles". This repository contains code for machine learning of the data produced by the CNSistent tool: https://bitbucket.org/schwarzlab/cnsistent which is required to provide the source data.

Repository structure

Notebooks

  • analyze_lung.inpyb: Analyzes the results of the lung classification and displays misclassified samples.
  • eval_models.ipynb: Reads training results from res folder to compare models on individual datasets.
  • eval_features.ipynb: Reads training results from res folder and produces summary tables and plots.
  • integrated_gradients_bins.ipynb: Reads a model and calculates integrated gradients using the CAPTUM library on fixed size bins.
  • integrated_gradients_genes.ipynb: Reads a model and calculates integrated gradients using the CAPTUM library on COSMIC gene set.
  • fit_random_tree_forest.ipynb: Fits a random forest model on the output of the dataset for comparison.
  • fit_umap.ipynb: Fits UMAP on the output of the dataset for comparison.

Code

  • cncc/models.py: Contains the PyTorch models definitions.
  • cncc/torch_utils.py: Functions for seeding, training, testing, and scoring PyTorch runs.
  • cncc/utils.py: Functions for IO and data formatting
  • train.py: Main training script, see below for use.
  • test.py: Main testing script, see below for use.

Other

  •  cncc.yaml: The conda environment used to run this repository.
  •  train_all.ps1: Batch script to run all ML processes used in the article.

Results

The `res` folder contains results of all optimizations with two files per run: 

  • `*_cm.tsv` is the resulting confusion matrix, analyzed in eval_features.ipynb notebook.
  • `*_res.tsv` is the optimization statistic, analyzed in eval_models.ipynb notebook.

Usage

python train.py

options:
  --feature FEATURE     Feature type (default: 1MB)
  --dataset DATASET     Dataset name (default: all), options: ['PCAWG', 'TRACERx', 'TCGA_hg19', 'all']
  --selection SELECTION Type selection (default: top_6), options: ['lung', 'top_{i}']
  --model MODEL         Model type (default: conv)
  --seed SEED           Random seed (default: 0)
  --folds K             Number of folds for K-fold cross-validation (default: 5)
  --batch BATCH         Batch size (default: 64)
  --subsample bool      Whether or not to subsample the dataset (default: False)

python test.py

options:
  --feature FEATURE     Feature type (default: 1MB)
  --train TRAIN         Training dataset name (if different from test) (default: )
  --test TEST           Evaluation dataset name (default: "all")
  --selection SELECTION Type selection (default: top_6), options: ['lung', 'top_{i}']
  --model MODEL         Model type (default: conv)
  --seed SEED           Random seed (default: 0)
  --folds K             Number of folds for K-fold cross-validation (default: 5)
  --batch BATCH         Batch size (default: 64)
  --subsample bool      Whether or not to subsample the dataset (default: False)

Files

project-cnsistent.zip

Files (4.7 MB)

Name Size Download all
md5:5abe0876c9b062597617411bfa540f18
4.7 MB Preview Download