A k-mer profiling workflow to generate personalized databases for mass spectrometry Céline M. Laumont April 17, 2018 1. Generation of the canonical cancer proteome 1.1. Quantification of transcript expression Run kallisto on trimmed RNA-seq reads: abundance.tsv 1.2. Identification of mutations Run STAR and freeBayes on trimmed RNA-seq reads: sampleName.vcf Format the resulting vcf file into a pyGeno agnostic file: sampleName.var.5X.pga 1.3. Creation of the pyGeno mutation datawrap Create the manifest.ini file for your set of mutations as described here: http://pygeno.iric.ca/datawraps.html. Run the following command line in your terminal to generate the associated datawrap: tar -cvzf .tar.gz manifest.ini .var.5X.pga 1.4. Creation of the canonical cancer proteome In the Personalized Proteome section of the capaMHC interface, import your pyGeno mutation datawrap using the Add SNPs set button. Using the Create personalized proteome button, create your sample’s canonical cancer proteome. You will be asked to import the relevant abundance.tsv file. Parameters used: -- tpm threshold: 0 -- all others: use default values Export the resulting fasta file: personalized_proteome_id_.fasta.txt Reformat this fasta file to be used in the k-mer profiling workflow using the following command line: python translation.py -n -pp -o The name of the final file will have the following format: sampleName_personalized_proteome_id_uniqID_reformat.fasta For help, use: python translation.py -h 2. Generation of the cancer-specific proteome 2.1. k-mer generation For both your Cancer and Normal samples, sequentially generate the k-mer database from raw RNA-seq reads using the following script: kmerGeneration.sh. General parameters to be specified: -- SAMPLE_NAME: name of the sample (ex: sampleName) -- PATH_TO_FASTQ: path to sequencing project (ex: ’https://bioinfo.iric.ca/seq/0d9a2dd2ce108338bd8273f34d9c452c/C98DJACXX/fastq/’) -- REP: path to sample/replicates if more than one, will be concatenated to the PATH_TO_FASTQ parameter (ex: ’Sample_PLC_12H018/’) -- NB_THREADS: number of processors to be used Trimmomatic-specific parameters: -- qLead: correspond to LEADING parameter (used Patrick’s default: 20) -- qTrail: corresponds to TRAILING parameter (used Patrick’s default: 20) -- minLen: corresponds to MINLEN parameter, decided to set it to longest k-mer size to be generated (i.e., 33 in our case) Jellyfish-specific parameters: -- STRANDED: specify the strandeness of the RNA-seq data (ex: ’TRUE’ or ’FALSE’) -- LEN: list of k-mer lengths (ex: (24 33)) -- HASH_SIZE: initial size of the hash table in jellyfish (ex: ’1G’, when working with transcriptomic data) -- OUTPATH: path to output directory After setting all parameters to the desired values, submit your task to the cluster via qsub (about 20Gb of memory, 8 ppn, should take between 1h and 4h...): module add torque qsub -V -l nodes=1:ppn=8,mem=20Gb,walltime=24:00:00 -m bea -M -N kmerGeneration.sh The resulting k-mer databases will be named as follows: sampleName.trim.R1rc.33.jf 2.2. k-mer filtering To extract cancer-specific k-mers, use the following script: kmerFiltering.sh Parameters to be specified: -- SAMPLE_NAME: name of considered sample (ex: sampleName) -- PATH_TO_JF: path to jf database of 33-nt-long cancer k-mers (ex: Cancer.trim.R1rc.33.jf) -- LCOUNT: minimal occurence for cancer k-mers to be considered expressed (≥), needs to be adjusted to get a maximum of 30M cancer-specific k-mers (ex: 10) -- PATH_TO_JF_FILTER: list of paths to jf database of 33-nt-long normal k-mers. k-mer filtering will be sequential (ex: (’Normal1.trim.R1rc.33.jf’ ’Normal2.trim.R1rc.33.jf’)) -- PATH_OUT: path to output directory Resulting files of cancer-specific k-mers will be named: * sampleName.10.allcount: list of all k-mers having an occurrence ≥ to LCOUNT and their associated count in all samples (cancer and normal(s)) * sampleName.10.Normal1.0.count: list of k-mers absent from the first normal database * sampleName.10.Normal1.0.Normal2.0.count : list of k-mers absent from the first and second normal databases * sampleName.10.filtering: number of k-mers after each filtering step Any file with the .count extension can be submitted to k-mer assembly. #### Verification needed we should strive for 10-30M kmers once the filtering is done % wc -l kmerFilt/*count 100160637 kmerFilt/OV633ascites.10.allcount 12039944 kmerFilt/OV633ascites.10.mtechumanAll.0.count 12M : good ! 2.3. kmer assembly Create the required tab-delimited project file (sampleName.10.Normal1.0.project): SAMPLE JF GROUP sampleName sampleName.trim.R1rc.33.jf Ref Perform k-mer assembly using the following script: kmerAssembly.sh Parameters to be specified: -- PROJECT_FILE: path to project file (ex: sampleName.10.Normal1.0.project) -- KMER_FILE: path to k-mer file (ex: sampleName.10.Normal1.0.count) -- KLEN: k-mer length (ex: 33) -- ASS_TYPE: specifiy the type of assembly to be performed. It is recommended to use the linear assembly as it is faster and more exhaustive than the graph assembly (ex: ’linear’ or ’graph’) -- CHECK_RC: specifiy if reverse complement sequences should be checked during k-mer assembly. Set this parameter to no when working with stranded RNA-seq data (ex: ’TRUE’ or ’FALSE’) -- PATH_OUT: path to output directory Both the assembly.tab and assembly.fasta files will be used in the in silico translation step. 2.4. in silico translation Translate your contigs (i.e., set of assembled k-mers) using the script translation.py. All keys, except the -pp key, should be used. For help, use: python translation.py -h Running with default parameters, the command line will look like this: % python ../scripts/translation.py -n -cf contigs//assembly.fasta -ct contigs//assembly.tab -o dbMS/ This should output the following file: sampleName_34_3-frame_8.fasta 3. Generation of the global cancer proteome Concatenate the reformated canonical cancer proteome and the cancer-specific proteome with the following command line: % cat sampleName_personalized_proteome_id_uniqID_reformat.fasta sampleName_34_3-frame_8.fasta > sampleName___PPuniqID.fasta, LCOUNT being the minimal occurence for cancer k-mers to be considered expressed (see Section kmer filtering). For each analyzed sample, submit to capaMHC the following fasta files: -- sampleName_personalized_proteome_id_uniqID_reformat.fasta (control database) -- sampleName___PPuniqID.fasta (concatenated database)