Published February 19, 2024 | Version 1.0
Dataset Restricted

CUPiD, A cfDNA methylation-based tissue-of-origin classifier for Cancers of Unknown Primary - classifier data and code

  • 1. Cancer Research UK National Biomarker Centre
  • 2. ROR icon The Christie NHS Foundation Trust

Description

This repository holds code behind the article "A cfDNA methylation-based tissue-of-origin classifier for Cancers of Unknown Primary" by Conway, Pearce, Clipson et al, published in Nature Communications. This contains the code and data required to generate the CUPiD classifier itself.

Data and code to reproduce the figures in the paper are available from https://zenodo.org/uploads/10684337 (unrestricted).

Methyl-Binding Domain protein sequencing (MBD-Seq) was applied to circulating cell-free DNA (cfDNA) samples derived from patients with a range of known cancer types (143 patients), as well as 106 non-cancerous controls (79 used in training). 

The objects deposited here include R data files containing qseaSets from the R package qsea, which includes the read counts per sample per 300 base pair window across the genome, as well as information on copy number variation and metadata tables. These are provided in the inputFiles/nextflowOutput folder, and are some of the outputs of the nextflow pipeline.

The scripts folder contains numbered sub-folders, with numbered scripts within them, which should be ran in order. The scripts are setup to be run on a PBS-Torque system; files ending ".pbs" should be submitted via qsub, files ending ".sh" should be ran on a node and will submit individual jobs within a loop. R scripts without an associated .pbs or .sh file should just be ran directly. All files should be submitted from the base of the repository (e.g. qsub scripts/01-downloadData/01-getRawData.pbs) to set the paths appropriately via the environment variable PBS_O_WORKDIR

  • 01-downloadData contains scripts to download and preprocess all the required data.
  • 02-qseaSetNextFlowPipeline contains our custom in-house DSL2 Nextflow pipeline which takes fastq files to processed qseaSets, including QC checks. This requires the fastq files which will be deposited in EGA.
  • 03-convertArrays converts downloaded (pre-processed) arrays into estimated qseaSets (containing solely the array sample), and then mixes each array with each NCC cfDNA at varying proportions.
  • 04-DMRs calculates pairwise DMRs between each class.
  • 05-prepForClassifier selects up to 10000 mixture sets per class, and generates a large table suitable for input into the ML model.
  • 06-fitClassifier fits the ML model using xgboost within the tidymodels framework. This is repeated 100 times with different subsets of the mixture sets as input.
  • 07-applyClassifier applies these classifiers to the "independent test cohort" - the set of 143 known tumour types, 27 additional NCCs and the 41 patients with CUP. These have been ran through the Nextflow pipeline separately to the 79 NCCs used to derive CUPiD, and have not been used to derive the classifier.
  • 08-UMAPs generates some UMAPs on the array data.

A subset of these output files are provided in https://zenodo.org/uploads/10684337 , along with the code to reproduce the figures.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This data may only be used for academic use.

Please email DAC@cruk.manchester.ac.uk to request a Data Access Request form, which will need to be signed by an institutional representative, as well as potentially an International Data Transfer agreement. 

Requests sent only via the webform will not be granted.

If you get no response from the email address above then please follow up, Zenodo does not send reminders about pending requests.

You are currently not logged in. Do you have an account? Log in here