Pathway-extended multigene expression signatures of chemotherapy responses to tyrosine kinase inhibitors: supporting data and program code
Authors/Creators
- 1. Western University
- 2. SHARCNET
- 3. Western University, CytoGnomix Inc.
Description
This Zenodo archive is associated with the research article, "Pathway-extended multigene expression signatures of chemotherapy responses to tyrosine kinase inhibitors," which is submitted for publication. In particular, it provides the source code, compiled versions, example inputs and output for several programs used in this paper. This archive also provides all database files used in this study (drug sensitivity data, gene expression and copy number data, etc.).
'MFAPreselection', is a novel software program that implements the network based search algorithm of biochemical pathways to extend machine learning based gene signatures derived from curated, peer-reviewed sources. It' is a Haskell-based program which was designed to perform Multiple Factor Analysis against biochemical data (e.g. gene expression and copy number data) of a pre-selected set of genes and drug sensitivity data from cell lines (e.g. GI50s) in order to identify those genes which show a direct or inverse correlation to drug response. The program performs "pathway extension", where genes which are biologically-related to the initial gene set (those which passed set MFA correlation thresholds) are also evaluated by MFA. This extension process can repeated to evaluate those genes that are related to the first set of expanded genes, and so on.
When completed, the program will output a list of genes which passed the set MFA correlation angle threshold (i.e. the "angle cutoff"), provide the correlation angle between drug sensitivity and the gene expression / copy number of this gene, indicate which expansion step it was matched in, and briefly describe how this gene is related to (at least one of) the original pre-selected genes.
1. "MFAPreselection-and-Library-Files.zip"
This archive contains the compiled version of MFAPreselection, all required library files, and an example configuration file. MFAPreselection was designed to be run on SHARCNET, a network of multiple high performance Unix-based supercomputers. MFAPreselection is run by simply invoking the program ("./MFAPreselection"), which then reads instructions from the "config.txt" file which should be located in the same folder. This configuration file is tab-delimited and has the following structure:
drug Drugname
genesInitial GENE1 GENE2 ... (tab-delimited set of initial genes associated with drug)
aliasesFile ./Path-To-Data-Files/GeneNames-Association.Pseudonyms.txt
relationsFile ./Path-To-Data-Files/PathwayCommons.OneNode.InteractionFile.txt.sif
gi50sFile ./Path-To-Data-Files/GI50-Data.txt
copiesFile ./Path-To-Data-Files/CopyNumber-Data.txt
expressionsFile ./Path-To-Data-Files/GeneExpression-Data.txt
angleCutoff 10
stepsCutoff 2
circleOutput True
mfaInput False
mfaOutput True
svmInput True
aliasOutput True
Where 'angleCutoff ' is the maximum MFA correlation angle for a gene to be considered correlated to GI50, 'stepsCutoff' is the maximum gene associated distance allowed by the program (e.g. '1' means MFAPreselection will look for genes related to your input gene set, '2' means it will also look for genes related to those genes found in '1'), "circleOutput" sets the program to generate MFA circle plots (this can add significant time to each run), "mfaInput" sets the program to create a file containing the GI50 / expression / copy number input data, "mfaOutput" generates a file called "MFA.tsv" which reports the correlation angle of GI50 to expression and copy number for all genes analyzed (also provides the "step" of the gene, and its relation to the initial gene set), "svmInput" generates files with GI50, gene expression and copy number data organized in a particular format for our machine learning programs, and "aliasOutput" generates a file which reports all events where a gene alias was used.
2. "MFAPreselection-Data-Files.zip"
This folder contains all database files used in the study first describing 'MFAPreselection'. This includes drug sensitivity data (GI50s), gene expression and copy number data, gene pseudonym associations file, and the interactions file. A description of each file is given below:
"PathwayCommons.OneNode.InteractionFile.txt.sif"
This file contains associations for all genes from PathwayCommons. Two examples of interactions in file:
"A1BG controls-expression-of A2M
A1BG interacts-with ABCC6"
"GeneNames-Association.Pseudonyms.txt"
A file (from genecards.org) which contains a list of official gene names (second column) and gives a list of their older pseudonym/aliases (ninth column; multiple aliases are pipe delimited).
"All Gene Expression Data.txt" and "All Copy Number Data.txt"
These files contain all gene expression and copy number values computed by Daemen et al. (2013). Rows are genes, columns are the cell line names.
"GI50-Data.txt"
This file consists of a table with all GI50 values for all of the cell lines tested (from Daemen et al., 2013). Rows are the cell line names, and columns are the GI50 values. Cell lines without GI50s for a particular drug appear as 'N/A', and will be skipped by MFAPreselection.
3. "MFAPreselection-Source-Code.zip"
This folder contains the source code (written in the Haskell programming language) for the program MFAPreselection. The 'Documentation' folder contains a README file (MS-Word) describing the program in added detail, and a diagram (pdf) showing how data flow is performed within the program.
4. "Automated-regularValidation_multiclassSVM-Job-Submitter-and-Data-Organizer.zip"
This folder contains multiple programs (for both the Perl and MatLab programming languages) that were used to perform traditional validation of multiple PE high performance models (derived for an individual cancer drug) within a command-line environment. Contents include example input data files necessary to run these programs, as well as documentation that describes the function of each program, the input files they require, and the output they provide.
Please note that these programs were designed for the SHARCNET high-performance supercomputer, which uses the Slurm Workload Manager to handle job submissions. These programs may require some modifications to work on other types of systems.
5. "Ensemble-Averaging-of-Predictions-By-regularValidation_multiclassSVM.zip"
The provided Perl / MatLab hybrid program was written to perform Ensemble machine learning-based averaging of multiple PE high performance models derived for an individual cancer drug. The program requires the output from the model validation program "regularValidation_multiclassSVM.m", first described in Zhao et al., 2018 and made available in a separate Zenodo archive. One could also use the output from the validation programs provided in this archive (4. "Automated-regularValidation_multiclassSVM-Job-Submitter-and-Data-Organizer.zip"). This folder provides example input data files necessary to run the program, as well as the documentation file "Description-of-Ensemble-Averaging-Program.docx" which describes the program (including the contents of these required input files).
This program was also designed for the SHARCNET high-performance supercomputer, which uses the Slurm Workload Manager to handle job submissions.
Files
Automated-regularValidation_multiclassSVM-Job-Submitter-and-Organizer.zip
Files
(873.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c9b8dae3bedf52735c15eba7d1595661
|
119.4 kB | Preview Download |
|
md5:d3ad9bfc2fc13ae564e7974f3561bb19
|
133.4 kB | Preview Download |
|
md5:bf9bbe635f54d837a16dd5dcf9b71dec
|
14.0 MB | Preview Download |
|
md5:de45289b9c053ef1148d86e1a6a8f028
|
858.6 MB | Preview Download |
|
md5:1b3df7c7a05062c6809f84ec21f6732d
|
153.1 kB | Preview Download |