CUPiD, A cfDNA methylation-based tissue-of-origin classifier for Cancers of Unknown Primary - classifier data and code
Creators
Description
This repository holds code behind the article "A cfDNA methylation-based tissue-of-origin classifier for Cancers of Unknown Primary" by Conway, Pearce, Clipson et al, published in Nature Communications. This contains the code and data required to generate the CUPiD classifier itself.
Data and code to reproduce the figures in the paper are available from https://zenodo.org/uploads/10684337 (unrestricted).
Methyl-Binding Domain protein sequencing (MBD-Seq) was applied to circulating cell-free DNA (cfDNA) samples derived from patients with a range of known cancer types (143 patients), as well as 106 non-cancerous controls (79 used in training).
The objects deposited here include R data files containing qseaSets from the R package qsea, which includes the read counts per sample per 300 base pair window across the genome, as well as information on copy number variation and metadata tables. These are provided in the inputFiles/nextflowOutput
folder, and are some of the outputs of the nextflow pipeline.
The scripts folder contains numbered sub-folders, with numbered scripts within them, which should be ran in order. The scripts are setup to be run on a PBS-Torque system; files ending ".pbs" should be submitted via qsub, files ending ".sh" should be ran on a node and will submit individual jobs within a loop. R scripts without an associated .pbs or .sh file should just be ran directly. All files should be submitted from the base of the repository (e.g. qsub scripts/01-downloadData/01-getRawData.pbs
) to set the paths appropriately via the environment variable PBS_O_WORKDIR
.
01-downloadData
contains scripts to download and preprocess all the required data.02-qseaSetNextFlowPipeline
contains our custom in-house DSL2 Nextflow pipeline which takes fastq files to processed qseaSets, including QC checks. This requires the fastq files which will be deposited in EGA.03-convertArrays
converts downloaded (pre-processed) arrays into estimated qseaSets (containing solely the array sample), and then mixes each array with each NCC cfDNA at varying proportions.04-DMRs
calculates pairwise DMRs between each class.05-prepForClassifier
selects up to 10000 mixture sets per class, and generates a large table suitable for input into the ML model.06-fitClassifier
fits the ML model usingxgboost
within thetidymodels
framework. This is repeated 100 times with different subsets of the mixture sets as input.07-applyClassifier
applies these classifiers to the "independent test cohort" - the set of 143 known tumour types, 27 additional NCCs and the 41 patients with CUP. These have been ran through the Nextflow pipeline separately to the 79 NCCs used to derive CUPiD, and have not been used to derive the classifier.08-UMAPs
generates some UMAPs on the array data.
A subset of these output files are provided in https://zenodo.org/uploads/10684337 , along with the code to reproduce the figures.