Classification of immune receptor repertoires using machine learning methods
- 1. University of Liverpool
- 2. University of Cambridge
Contributors
- 1. University of Cambridge
- 2. University of Oxford
- 3. Royal Surrey County
Description
R code to generate k-mer counts from a set of CDR3 sequences, and to classify samples based on these counts.
Contains 3 main functions:
NonPos_Matrix: A function to generate a matrix of kmer counts from a matrix of CDR3 counts.
input: InputFile - csv file which contains a cdr3 by sample matrix of cdr3 counts.
k - length of kmer to be identified
OutputFile - file to write out the kmer matrix to, should be .csv.gz
output: writes out the kmer matrix to specified OutputFile
Pos_Matrix: A function to generate a matrix of positional kmer counts from a matrix of CDR3 counts.
input: InputFile - csv file which contains a cdr3 by sample matrix of cdr3 counts.
k - length of kmer to be identified
OutputFile - file to write out the kmer matrix to, should be .csv.gz
output: writes out the kmer matrix to specified OutputFile
ClusterOptim: A function which clusters a kmer count matrix based on its principal components, and identifies the set of principal components which generate the optimal clustering.
input: file_name - csv file which contains a kmer by sample matrix of kmer counts, of the type generated by NonPos_Matrix and Pos_Matrix
classes - a vector of 1s and 2s, in which each entry corresponds to a column (sample) of the input file and indicates whether the sample is a case (=1) or a control (=2)
plotAll=FALSE - logical, should plots of hierarchies of all PC combinations evaluated be plotted.
outDir='Clusters' - name of directory to which all output files are written.
output: writes out 3 files; plot of the optimal hierarchy, accuracy of all PC combinations evaluated, and summary of accuracy of all PC combinations evaluated. Optionally also writes out plots of hierarchies for all PC combinations evaluated.